Copyright in AI training: remuneration right for creators and licence obligation for providers

There has long been a global debate between copyright holders and producers of AI systems about whether, and if so, under what circumstances, copyright is infringed when scraping and training generative AI such as Large Language Models. Most proceedings are ongoing in the US, with Anthropic recently securing the largest settlement ($1.5 billion) with rights holders in a copyright case. In earlier blog, we wrote about a lawsuit currently before the Court of Justice of the European Union. Those proceedings also include the question of the extent to which Google, with its AI chatbot Gemini, was entitled to use copyrighted content without the copyright holder's permission to train the chatbot. The ECJ's answer to those questions is still some time away and is not expected until 2027.

So in the meantime, it remains uncertain whether the text and data mining exception applies to copyrighted content used to train AI systems. There is, however, a judgment of the Amsterdam District Court from October 2024 which summarily ruled that HowardsHome, a party creating short alerts based on copyrighted articles, could invoke the text and data mining exception because the copyright holders had not expressly reserved text or data mining on their website.

Now the German court has ruled in a similar matter whether OpenAI could rely on the text and data mining exception of the Digital Single Market Directive (DSM Directive') to the extent that copyrighted song lyrics are included in the language models of Open AI's AI systems and in the output of those AI systems. The case was brought by GEMA (which is the German sister of BUMA STEMRA) representing music authors. In its judgment, the court gave detailed reasons that, in its view, OpenAI cannot invoke the text and data mining exception from the DSM Directive and that there is copyright infringement through the reproduction of song lyrics in the language models and output of OpenAI's AI systems.

In its judgment, the German court discusses the technique used to train AI systems quite extensively. For instance, the judge considers that it is known from research that training data can be included in language models and extracted as output. This process is called memorisation. Memorisation occurs when, during training, the language models not only extract information from the training dataset, but when the training data is fully incorporated into the parameters specified after training. The German court noted that such memorisation in OpenAI's AI systems was established by comparing the song lyrics included in the training data with the representations in the outputs. Given the complexity and length of the lyrics, coincidence can be ruled out as a cause for the display of the lyrics. It is concluded that the song lyrics were indirectly observable and therefore reproducibly recorded in the language models which is not allowed without permission from the copyright holders.

The court ruled that the text and data mining exception from the DSM Directive is intended for necessary reproductions when compiling the data corpus for training, such as the reproduction of a work by converting it to another (digital) format or storage in working memory. These reproductions are for later analysis purposes only and do not affect the author's exploitation interests and are therefore permitted. On the contrary, the output of OpenAI's AI systems do touch the authors' exploitation interests and therefore do not fall under the text and data mining exception, according to the German court.

The German court's ruling is a big boost for copyright holders in Europe. In my view, the judgment, which is extensively reasoned, nicely illustrates that with the advent of AI systems, there is still room for copyright. Copyright holders are entitled to compensation if their copyrighted work is used to train AI systems. This means that copyright holders and AI providers need to get around the table over compensation and shows the need for low-threshold collective licensing to use copyrighted content for training AI systems. A good example is GPT-NL which offers such licences.

More information?

If, as a copyright owner, you want to know how to protect your exclusive rights within AI context, TK's specialists are at your service. Also for parties in the AI sector who want to know how to respect copyrights in the development and training of AI systems, TK's specialists are ready to help.

German court rules OpenAI cannot invoke text and data mining exception

More information?