Training GenAI: Infringement or Fair Use?

Discussing the implications of unauthorized use of materials for training Generative AI models, we are pleased to bring to you this guest post by Goutham Rajeev and Vedant Bharadwaj Singh. The authors are third year students at the Hidayatullah National Law University, Raipur.

Training GenAI: Infringement or Fair Use?

By Goutham Rajeev and Vedant Bharadwaj Singh

A recent public response by OpenAI’s CTO (Chief Technical Officer) on the data sources used by them to train their new text-to-video AI engine known as Sora has reignited the discussion on the use of copyrighted material to train Generative Artificial Intelligence (GenAI).

Presently, GenAI is primarily trained through Machine Learning (ML) which is a process that allows machines to learn from data and past experiences to identify patterns with minimal human interference. In the case of GenAI models, ML utilises the process of Text and Data Mining (TDM). TDM is primarily a process of deriving data by analysing patterns and learning from it in order to create new results. This method involves the feeding of a large amount of data, including  copyrighted works. Moreover, it is also necessary for the program to make copies of the information before it analyses it. A more detailed understanding of TDM may be found here.

As the worlds of GenAI and copyright collide, the issue before us is whether the use of copyrighted works through TDM for the training of GenAI amounts to copyright infringement? For the purposes of this post, the authors will primarily explore the “substitution effect” ‘impact of market’ leg of the four-factor test of fair use of the USA as a defence against infringement. Primarily because many suits related to GenAI, TDM and copyright infringement have been initiated and are still pending in the USA.

Whether the Process of TDM Amounts to Copyright Infringement?

The use of TDM processes by corporations to train their software programs has been a contentious legal question. The law was settled in the United States, in the cases cited below which have become the standard in various other jurisdictions, such as the EU.

In Authors’ Guild v. HathiTrust, the Court was called to examine whether text mining and compilation of works to make them accessible to the specially abled would constitute fair use. The Court, deciding in favour of the defendants, said that data mining in itself could not constitute infringement and held that the use of text mining to create a searchable database, in fact, created a transformative work which changed the essential purpose of the work. It was held that there was no infringement.

Moreover, the case of Authors’ Guild v. Google Inc. was also on the question of whether the use of copyrighted books to create a searchable database, and to also allow small snippets of the books that were uploaded constituted fair use despite Google being a for-profit organisation. In its judgement, the Court held that while there may be some loss of income for the publishers, Google has only exercised the fair use of the material by allowing for a free indexing system for the public, instead of any substantial reproduction of the works.

However, in the Google case, the Court also held that the for profit objects of Google do not necessarily imply an adverse inference in case of copying material, as they have not used it in a “substitutive” manner which would negatively affect the authors and their market for books.

Hence, it can be said that while TDM does not prima facie amount to infringement, its usage should not have a detrimental impact on the market of the respective copyright holder.

Whether the use of TDM for Machine Learning by a for-profit GenAI amounts to Copyright Infringement?

The test of substitution previously discussed becomes crucial while determining copyright infringement while using TDM to train GenAI. The “substitution” effect was more effectively mechanised in the Warhol judgement. The judgement overhauled the existing application of the four-factor test. The court in Warhol identified that the primary question is not about how the derivative work is different from the original work. Rather it is the extent to which the derivative work is intended to substitute the ways in which the owner of the original work would have exploited the work to generate revenue i.e., the impact on the potential market.

GenAI has the potential, and capacity to have a ‘substitutive’ effect on various markets. This substitutive effect may be illustrated by the use of AI in various academic writings already, and also its use in aiding research and writing in general. It is already a raging discussion, the effect that AI and specifically GenAI will have on various sectors, by creating a loss of employment for millions. In this context, even though the benefits of GenAI are innumerable, it must also be taken into consideration, the fact that it should not act as a replacement for creative actions by humans as well. In fact, in cases where GenAI models have been trained to generate entire pictures or videos from text prompts, these industries are already at risk of succumbing to a complete revolution brought about by AI. For instance, consider the recent sale of the Portrait of Edmond Bellamy or Jason Allen’s ‘Theatre D’Opera Spatial’ winning the Colorado State Fair’s annual fine art competition. These instances are evidence of the impactGenAI is currently having on the art industry.

Research has pointed out that a massive displacement crisis will be seen across sectors, and even if they are mitigated, it is argued that any such displacement must not come at the expense of a potential negative impact on the existing markets of the copyright holders as has been laid down in the fair use doctrine.

A Permanent Fix: Adjusting EU’s Opt-Out Mechanism to the Indian Scenario

Given the recently evolved test of substitution, the lawsuits against Midjourney, Stability AI, OpenAI and other GenAI seem to fall in the favour of the plaintiffs. Even if it doesn’t, the impact of GenAI is undeniableand it is only expected to grow in magnitude. However, the clash between copyright and GenAI needs something more than judicial interventions. The same can be reflected in the Indian government’s stance on looking to frame a new law on artificial intelligence. Owing to the market disruption caused by GenAI, the government is hopeful that their new law would be able to appreciate the creativity quotient, both in terms of intellectual property as well as financial and commercial implications.

In that regard, various jurisdictions have made specific exceptions relating to data mining in their fair use legislation, such as Japan, and EU. In the context of commercial TDM, integrating an opt-out system similar to EU Directives into the Indian framework is an option that the policymakers should consider. Article 4 of the Copyright in the Digital Single Market (CDSM) Directive provides a commercial exception to TDM activities. It addresses the reproductions and extractions of lawfully accessed works for the TDM if not expressly reserved by the right holder in the prescribed manner of terms. The objective of the article is to acknowledge that the use of data to train AI systems should not excessively restrict the rightsholder’s right of exploitation and reproduction.

This approach would enable content creators to exclude their material from GenAI training datasets. While aiming to create a fairer landscape, this model currently presents drawbacks with respect to power balance and transparency that may be detrimental to both GenAI owners and copyright holders.

The opt-out provision as envisioned in the EU might give copyright holders too much power, possibly resulting in high fees for using GenAI training data, slowing down small AI startups. Thus, India could adopt a fairer model, based on data equity and algorithmic fairness, could prevent exploitation by either party. Techniques for attributing data could be used to calculate the amount of copyrighted data in an AI creation and distribute compensation to human authors annually. For instance, companies like Adobe and Stability AI compensate human authors who contribute their work to AI training.

The EU’s opting-out system requires GenAI owners to list training data, which is challenging due to factors like varied copyright ownership and incomplete metadata. If India adopts this, it must consider the burden. Lawmakers should clarify how much transparency is needed for TDM opt-outs. Also, the mechanism must overcome logistical issues and grant creators access to datasets.

Conclusion

As we navigate the AI revolution, understanding and responsibly guiding its development are imperative. While jurisprudence has evolved to address issues like TDM, applying these principles directly to GenAI isn’t advisable. It’s crucial to strike a balance between GenAI’s development and protecting the creators whose content it learns from. Failing to do so risks undermining creative works. Implementing measures like a substitutive effect test and opt-out mechanism could help mitigate these risks and create a legal framework that accommodates all stakeholders. This approach aims to counter any negative impacts GenAI may have on society.

Read More