Skip to Main Content
XClose

Library Services

Home

UCL LIBRARY SERVICES

Generative AI and library skills

This guide explores the use of AI and generative AI in the context of the library research process

GenAI training data and copyright infringement

If GenAI models are trained on materials protected by copyright without the permission of the copyright owner, does this mean that they could be violating copyright laws?

The answer to this is complicated.

Unless copyright has expired or material has already been licensed under terms that allow reuse, e.g. under the Creative Commons Attribution licence, normally permission is required to reproduce, share and reuse material. Reproducing and sharing material without permission or a licence from the copyright owner could be unlawful.

However, in certain cases materials may be reused without permission, for specific purposes defined in the law (‘copyright exceptions’ or ‘permitted acts’) or if certain criteria are met, e.g., ‘fair use’ in the US. Training GenAI models may rely on these exceptions. Since copyright laws vary across countries, it is also crucial where the training activity took place.

Using copyrighted materials without permission to train GenAI could, therefore, be perceived as unlawful, or it could be deemed to be permitted under an exception ,e.g. the text and data mining exception in the UK, or ‘fair’. This is being decided in relevant court cases, whose outcomes help shape how copyright applies to GenAI.

Ongoing court cases

A number of copyright owners have sued AI companies for copyright infringement. Key cases include:

  • Getty Images vs Stability AI: Getty images are suing Stability AI for using millions of their stock images to train their AI models without permission. It is being argued that using the images to train the models involved unauthorised copying of the images, false endorsement and generation of images that incorporate part of the originals. The outcomes of the case will help shed light on whether copying took place in the first instance, whether the use of the images could be justified by a copyright exception, e.g. ‘fair use’ in the US or pastiche in the UK, and whether any generated outputs that look similar to the Getty images have in fact reproduced parts of the originals. You can read more about the case on the Gilbert and Tobin website.
  • New York Times vs OpenAI and Microsoft: New York Times claim that millions of their subscription articles were used without permission to train ChatGPT and Microsoft Copilot and produce derivatives that compete with the original work and cause financial and reputational damage to NYT. Crucially, NY Times have provided evidence that the GenAI outputs can be exact reproductions of the original articles. OpenAI is treating this as a technical issue that can be fixed, and further responded by developing opt-out mechanisms for authors of the original works. The outcome of this case should provide a steer on what is considered ‘infringement’ and ‘fair use’ in GenAI training and may also forge a distinction between possible infringement at the training stage and infringement when providing prompts. An interesting discussion on this case can be found in The Conversation.
  • Authors’ Guild vs OpenAI and Microsoft: The Authors’ Guild and several individual authors including John Grisham, Jodi Picoult, R.R. Martin and Michael Connelly are suing OpenAI and Microsoft for unlawfully accessing their books from pirate sites and using them to train their models and produce derivative outputs that compete with the market of the original books. Similar cases by individual authors have been raised separately, again raising issues of infringement and applicability of copyright exceptions. You can read a nuanced discussion of these issues on the TechnoLlama website.
  • Doe 1 v. GitHub: In 2022 several code developers sued GitHub Copilot and Open AI for copyright infringement, violation of the terms of open source licences and breach of contract. While the court dismissed most of these claims, this case is interesting as it highlights the issue of attribution. Even when a work is available under an open licence and may therefore be reused, authors would expect the terms of the licence, including attribution, to be respected.

These cases highlight the complexity of copyright as it applies in GenAI but also raises broader questions:

  • What constitutes copying in GenAI training?
  • How can it be evidenced that a GenAI output is the result of copying substantial parts from the source material?
  • Even if reproduction took place, could it be justified by a copyright exception or ‘fair use’?
  • How can issues such as attribution be addressed?
  • Importantly, if it is found that an AI company did infringe copyright, does this mean that users of the tool could be liable for copyright infringement, too?

While these questions are being addressed in the courts, it is advised to be mindful of the issues and use tools that demonstrate some degree of transparency in the way they work and their terms and conditions.

It is also worth noting that, while the outcomes of these cases will certainly be informative, copyright considerations in academic settings are very likely to be different than criteria and considerations applied in the creative industries and commercial settings. See related commentary on fairness criteria on the SPARC website, particularly point 3.

The text and data mining (TDM) exception in the UK: relevance to GenAI

UK legislation includes a copyright exception allowing copying for the purposes of computational analysis of text and data, as long as the use is non-commercial, the user has lawful access to the materials and the sources are acknowledged (unless it is impossible to do so for practical reasons). For more detail on the exception see our TDM guidance.

The question is whether the exception could be applied to train GenAI models. This is important if your research involves developing / training a GenAI model. A court case recently ruled in Germany (Kneschke v LAION), which has similar exceptions including one on TDM for research purposes, should help shed light on this. The photographer / copyright owner of an image sued the LAION organisation for copying the image without permission, for the purposes of creating a dataset to support AI training. The case is quite complex; full details on the case are discussed on the Kluwer copyright blog and the TechnoLlama website. Here we highlight the relevance of the court’s decision to (a) confirm that making a copy of an image in order to extract information from it is covered by the exception and (b) that the activity was non-commercial research. Although the decision did not cover the further use to train the model, comments by the judge suggest that TDM exceptions could extend to AI training.

In an academic setting, asserting the right to rely on the TDM exception to train AI in research is important. Some publishers may have clauses in their terms of use that preclude the use of their articles for AI purposes. This is being challenged; please see relevant guidance by JISC.

If your research involves TDM and you are unsure about publishers’ clauses or encounter technological barriers when copying the data, please contact us for advice.

Openly licensed training data and the issue of attribution

Copyright breaches can still happen even if the training data and prompts are shared with a licence allowing reuse, such as a Creative Commons licence or an open source software licence.

If GenAI activities rely on a licence, the terms of the licence must be respected. This includes requirements to attribute the author and meeting specific terms of a licence, for example no-derivatives, share-alike and non-commercial restrictions. These points need to be addressed both if you are creating your own model and if your work is being used to train AI models. Creative Commons have a useful article and flowchart showing in which cases of GenAI activity different terms of the licences apply.

Attribution is, of course, a requirement of all six CC licences; attribution is also expected for materials that are not openly licensed, as part of good academic practice and research integrity and as part of fair dealing if relying on exceptions. There are concerns that GenAI outputs do not attribute their sources or, if they do, attributions can be inaccurate or fabricated altogether.

Solutions to this might involve a combination of approaches:

  • Technical advances, for example retrieval-augmented generation techniques that aim to improve the accuracy of the models and the veracity of the sources used.
  • Legal requirements for better transparency in disclosing training data sets, as is the case with the EU AI Act.
  • Audit projects such as the Data Provenance Initiative, which aims to increase the transparency of training data sets through analysing provenance and licensing terms.

For an extensive discussion of infringement and attribution issues in GenAI, see Johnson A. Generative AI, UK Copyright and Open Licences: considerations for UK HEI copyright advice services [version 1; peer review: 2 approved]. F1000Research 2024, 13:134 (https://doi.org/10.12688/f1000research.143131.1).

Prompts, copyright infringement and liabilities

As a user of GenAI tools, you will be providing prompts in the form of text, images, code, film etc. You could be breaching copyright if your prompts are someone else’s intellectual property and you don’t have permission or a licence to share them with a third party. This may include, for example, articles that UCL subscribes to which are provided for personal research and study or images for which you do not own the copyright.

A highly publicised case reflecting this involves Tesla using a still from the film Blade Runner 2049 without permission in October 2024. Tesla first approached Alcon Entertainment LLC, the producer of the film, to ask for permission to reuse the image. When this was denied, Tesla used the image as a prompt in a GenAI tool to generate a new version, which was shared as part of a promotional event. The outcome of the Alcon vs Tesla case should also provide insight on infringement in the context of GenAI.

You could also be infringing copyright if your generated output is reproducing substantial parts of original content that is protected by copyright and not licensed for reuse. Several AI tools, usually paid versions, offer indemnities to cover legal expenses in the event of a user being sued for copyright infringement. However, these indemnities are limited and not likely to offer comprehensive cover. More advice on indemnities and their limitations can be found on the Farrer&Co website.