Library guides and databases: Data and Statistics: Mining text and data

Text and data mining (TDM)

Image by Davide Bonazzi at www.copyrightuser.org/understand/exceptions/text-data-mining/

There are various definitions of TDM which cover both the technicalities and utilities of the practice. The UK Intellectual Property Office (IPO) define TDM as: ‘The use of automated analytical techniques to analyse text and data for patterns, trends and other useful information’. Even within TDM, there are different definitions for both text and data mining. Text mining is more commonly seen as the computational process of discovering and extracting knowledge from unstructured data. Data mining, on the other hand, is the computational process of discovering and extracting knowledge from structured data. 

There has been a surge of interest in the use of TDM in academia across all disciplines ranging from the sciences to the humanities. However, TDM entails a range of legal and political issues which need to be considered, primarily centred around copyright, intellectual property rights, licences and download limits. 

Learn more

Why do TDM?

Barriers to TDM

Accessing material for TDM

How can Library Services help?

Further resources

Why do TDM?

Firstly, TDM can make research easier for those seeking to examine a large corpus of documents in order to discover underlying trends across multiple datasets. TDM is often cited as a way of increasing the progress of scientific discovery. But TDM is also useful for researchers working in the humanities to mine sources like journals and newspapers.

The Advanced Research Computing Centre (ARC) team at UCL work closely with a number of departments around UCL by collaborating on a range of software projects including Oceanic Exchanges, ForecastCC and the UCL-wide CloudLabs. 

Barriers to TDM

In recent years, some changes have been made to the UK’s current intellectual property framework in order to support innovation and growth. The Hargreaves Report (2011) introduced a copyright exception in UK law to allow for the use of analytics for non-commercial use. Yet, there are many barriers to TDM. Some of these issues have been studied in more detail by Michelle Brook, Peter Murray-Rust and Charles Oppenheim. They have argued that there are a number of non-technological barriers that need to be overcome in order to realise the full potential of TDM. They raise concerns about the legal issues of TDM surrounding copyright law and database rights but also offer some guidelines about how publishers can help to overcome these barriers to research. For example, this includes giving researchers lawful access to original materials and making clear distinctions about what research is regarded as ‘commercial’ and ‘non-commercial’.  

Accessing material for TDM

While a specific licence is not generally required for academic TDM, many publishers provide explicit support for TDM by academic users through a specific interface. This allows higher rates of access, and avoids problems that can come from intensively crawling the publisher sites.

For example, in the area of scholarly journals, Elsevier, Springer and Wiley all provide access to journal content for TDM through a dedicated Application Programming Interface (API). There is no 'one-stop shop' for TDM across multiple providers, but depending on the material you are looking for, it may be possible to use the Crossref, Scopus, or Web of Science APIs to get some initial data. If you are considering doing TDM work from a content provider who does not offer a specific API, and you will be downloading a large number of records, it is recommended that we consult with them first.

UCL's API availability by publisher
An Excel spreadsheet listing UCL's access to publisher APIs. Please contact us (lib-eresource-help@ucl.ac.uk) if you require further information, including for publishers not listed here.

How can Library Services help?

There are a number of ways in which the library can assist researchers with TDM:

Provide advice on the tools available to undertake TDM alongside the type of sources you may wish to consider analysing.
Refer researchers to other specialists who can assist further with the technicalities and legalities of TDM.
We play an important role in continuing to promote and build TDM networks across the university and beyond.

Further resources 

What's an API?

APIs, or Application Programming Interfaces, are tools used to share data between software applications. They are used in a variety of contexts, including embedding content from one website to another, dynamically posting content from one application to display in another, or extracting data from a database in a more programmatic way than a regular user interface might allow, such as bulk collection for text mining.

Gale Digital Scholar Lab

New to text mining? Gale Digital Scholar Lab is a cloud-based research and learning platform that allows students and researchers to apply natural language processing tools to raw text data (OCR text) from Gale’s primary source archives, but also your text files outside of these collections, all without the need for specialised programming skills.