Image by Davide Bonazzi at www.copyrightuser.org/understand/exceptions/text-data-mining/
There are various definitions of TDM which cover both the technicalities and utilities of the practice. The UK Intellectual Property Office (IPO) define TDM as: ‘The use of automated analytical techniques to analyse text and data for patterns, trends and other useful information’. Even within TDM, there are different definitions for both text and data mining. Text mining is more commonly seen as the computational process of discovering and extracting knowledge from unstructured data. Data mining, on the other hand, is the computational process of discovering and extracting knowledge from structured data.
There has been a surge of interest in the use of TDM in academia across all disciplines ranging from the sciences to the humanities. However, TDM entails a range of legal and political issues which need to be considered, primarily centred around copyright, intellectual property rights, licences and download limits.
Firstly, TDM can make research easier for those seeking to examine a large corpus of documents in order to discover underlying trends across multiple datasets. TDM is often cited as a way of increasing the progress of scientific discovery. But TDM is also useful for researchers working in the humanities to mine sources like journals and newspapers.
The Advanced Research Computing Centre (ARC) team at UCL work closely with a number of departments around UCL by collaborating on a range of software projects including Oceanic Exchanges, ForecastCC and the UCL-wide CloudLabs.
In recent years, some changes have been made to the UK’s current intellectual property framework in order to support innovation and growth. The Hargreaves Report (2011) introduced a copyright exception in UK law to allow for the use of analytics for non-commercial use. Yet, there are many barriers to TDM. Some of these issues have been studied in more detail by Michelle Brook, Peter Murray-Rust and Charles Oppenheim. They have argued that there are a number of non-technological barriers that need to be overcome in order to realise the full potential of TDM. They raise concerns about the legal issues of TDM surrounding copyright law and database rights but also offer some guidelines about how publishers can help to overcome these barriers to research. For example, this includes giving researchers lawful access to original materials and making clear distinctions about what research is regarded as ‘commercial’ and ‘non-commercial’.
While a specific licence is not generally required for academic TDM, many publishers provide explicit support for TDM by academic users through a specific interface. This allows higher rates of access, and avoids problems that can come from intensively crawling the publisher sites.
For example, in the area of scholarly journals, Elsevier, Springer and Wiley all provide access to journal content for TDM through a dedicated Application Programming Interface (API). There is no 'one-stop shop' for TDM across multiple providers, but depending on the material you are looking for, it may be possible to use the Crossref, Scopus, or Web of Science APIs to get some initial data. If you are considering doing TDM work from a content provider who does not offer a specific API, and you will be downloading a large number of records, it is recommended that we consult with them first.
There are a number of ways in which the library can assist researchers with TDM:
APIs, or Application Programming Interfaces, are tools used to share data between software applications. They are used in a variety of contexts, including embedding content from one website to another, dynamically posting content from one application to display in another, or extracting data from a database in a more programmatic way than a regular user interface might allow, such as bulk collection for text mining.
New to text mining? Gale Digital Scholar Lab is a cloud-based research and learning platform that allows students and researchers to apply natural language processing tools to raw text data (OCR text) from Gale’s primary source archives, but also your text files outside of these collections, all without the need for specialised programming skills.