Discussing Lemley and Henderson’s  “The Mirage of Artificial Intelligence Terms of Use Restrictions”

Image from here

“There is a certain hypocrisy in arguing that training models on the public’s data is fair use but then seeking to prevent others from doing the same thing.”

-Lemley and Henderson

Generative AI companies have been in the news all around this year. For both good and not-so-good reasons. Various Suits have been filed against these companies, arguing that the training process violates copyright protection. (USA,GermanyCanada). No one knows what the outcome of these ongoing disputes will be. Amidst this, Prof. Mark Lemley and Peter Henderson recently came out with an interesting paper which has relevance, not only for present lawsuits but possible future lawsuits relating to AI companies. The paper discusses the terms of use of various GenAI models and inquires, in the context of the US legal landscape, whether the output-based restrictions contained in the terms of use are enforceable or not. The answer to this question has implications for potential future lawsuits by GenAI companies against individuals/companies which develop AI models using their model outputs. In this post, I will focus on the part related to anti-competitive provisions in GenAI companies’ terms of use. 

Terms of Use of Generative AI Companies 

It is trite truth that most of us do not bother reading the terms and conditions for almost anything (pick a hobby if you are!). However, the terms of use for Generative AI companies make for an interesting read. The terms of use clarify the ownership of the output generated and how such output can (and cannot) be used. For anyone following the barrage of lawsuits filed against these companies, the prominent affirmative defense raised against the allegation of ‘direct copyright infringement’ has been (no prize for guessing the obvious)- “Fair Use.” For instance, in Concord Music Group v. Anthropic, Anthropic (creator of Claude) has argued that using copyrighted song lyrics to train an AI model is ‘transformative’ use since the lyrics are not being used for the same end for which they were created. Rather, it is being used to break the lyrics into small tokens to derive statistical weightage. Fair use, as a defense, is also being argued in Richard Kadrey v. MetaMike Huckabee v. BloombergThomas Reuters v. ROSS Intelligence, Paul Lehrman v. LOVO and New York Times Company v. Microsoft Corporation

So, GenAI companies argue that they can use copyrighted work, without permission, to train their AI model because it is covered by fair use. How about using the output generated by these models to train my own AI model? To answer this, we must ask ourselves a basic, preliminary question- Is this output copyrightable in the first place? If it is not, surely, I can use it to train my model without any restriction. Or if it is, I can argue, same as GenAI companies, that it is fair use. Either way, seems like I can train my model using “synthetic data.” Seems simple. Well, it isn’t.   

Lemley and Henderson point out that most of these tools come with a caveat: “You cannot use the proprietary models’ output for training new models.” They tabulate the terms of use of various GenAI companies (pg. 14 & 15), restricting users from using the output generated by their models to train a ‘competitive’ AI model:

Lemley & Henderson call these restrictions “Anti-competitive, Anti-scraping and Anti-Reverse Engineering” provisions. On one hand, these companies are arguing that they can use copyrighted works of human authorship to train their models, and on the other, prohibit users from using the output generated to train another competing model. Are these restrictions legally enforceable? The authors are precisely interested in this question. 

No copyright over the Model Output

The authors, first, focus on the question of the copyrightability of the output generated by an AI model. Relying on US Copyright Office guidelines, the authors argue that model output is not copyrightable since it lacks sufficient human authorship or creative input. The ultimate creative control remains with the model. Lemley, in one of his papers, argues that model output could be copyrighted if the user engages in prompt engineering, demonstrating the creative intervention of a human author. Even then, the model creator does not get copyright over the output. Therefore, they conclude that “any output generated purely as a function of the AI model is not copyrightable.”

Interestingly, the authors highlight that most GenAI companies, in their terms of use, assign the right of ownership of the output to the user. For instance, Meta Llama 3 terms of use says:

If it is not copyrightable in the first place, then how can GenAI assign any ownership over to the user, let alone restrict the use of the model output?

Contractual Claim Failing

The authors then try to look at whether a contractual claim can be made for the enforceability of these terms. 

First, the authors argue that the anti-competitive provision, which prohibits users or third parties from using the model output to train a competing AI model is likely to be not enforced. Surveying the US case laws on the issue, it is argued that restrictive conditions which refuse access to the service altogether for anti-competitive reasons (in this case not developing a competing AI model using model output) are likely to be defeated in anti-trust claims. If not for anti-trust, Courts can also hold that these terms are unenforceable for violating public policy considerations, in this case prohibiting others from competing with existing models using their uncopyrightable output. 

Even assuming that these outputs are copyrightable, the authors argue, a claim under the copyright misuse doctrine can be successfully made against such anti-competitive provisions for unauthorizedly expanding the copyright monopoly. In India, this doctrine has so far been discussed in only one case- Tekla Corporation vs Survo Ghosh– which held that the doctrine was inapplicable in India. (here)

The authors also discuss copyright pre-emption doctrine which, however, is not relevant in the Indian context. (refer here for more on this)

The authors argue that these terms, practically speaking, are also tough to enforce. AI companies do not restrict the sharing of model output. Further, these companies do not seek to claim any ownership over the output generated by the model. The authors, therefore, point out that others (third parties) who get the data from a user are not bound by the restrictive conditions of the terms of use since they have not entered into a contract with the model creator. Moreover, Courts, they point out, do not favour restrictive covenants which control ‘downstream uses of things’ which a person/company does not own. 

Extending Monopoly under Copyright Law

For some time, I have been fairly interested in the practice of copyright owners extending their monopoly rights over their work using Contract Law. As a result, not only do they retain their rights under the copyright law, but also appropriate other entitlements over the work not provided under Copyright Law. Sometimes, this attempt to extend their monopoly over the work (in addition to what is guaranteed in the law) might transgress into fair use. 

Although Lemley and Henderson argue that a model output is highly likely to be uncopyrightable, I think GenAI terms and conditions also highlight how copyright monopoly is sought to be unauthorisedly expanded. For instance, terms of use for certain models also prohibit a person to “reverse engineer, decompile or discover the source code or underlying components.” Under Section 52(1)(ab) of the Copyright Act, a user can do any act, including reverse engineering, to obtain necessary information to achieve inter-operability. Assuming the output is copyrightable, is this condition valid? 

In Tekla, the DHC had said that if a copyright holder imposes restrictive conditions which prohibits someone from doing what is permitted u/s. 52 (fair use), the same would be illegal and not enforceable. Further, if such illegal conditions are being imposed, a person could pursue independent legal action against the copyright owner to prevent such misuse of copyright. In this paper, Matthan and Narendran argue that fair use, rather than just being a defence to infringement, is also a user right under copyright act. They argue that there is a utilitarian rationale underlying the copyright regime.(here) As a result, the right of fair dealing represents the careful balance between ensuring author’s monopoly as well as dissemination of information and public access to knowledge. Thus, they argue, any contractual or licensing covenant which restrict a user’s right of fair use will likely be held to be void under Section 23 under the Indian Contract Act for being contrary to public policy. Further, such owners imposing these illegal covenants could also be prosecuted under section 63 of the Copyright Act. Prof. Aparajita, in this post, also highlights that fair use operates as a ‘user right’ which balances competing interests of authors and users. Any contractual waiver of the same might defeat public policy. Akshat also argues that statutory exemption under copyright Act, emanating from Article 19 of the Constitution, cannot be waived off under Contractual terms. 

As a result, although there is no misuse doctrine in India, there are strong grounds to oppose the enforceability of these licensing terms even in India. This has relevance beyond GenAI as other entities, especially the ones dealing in computer program also indulge in this practice. (read it here)

Relevance for Knowledge Distillation

AI model, to be accurate, needs to be trained on large amounts of data to learn statistical patterns (weight) and accordingly provide output to a given prompt. The corpus of data required to create a training database requires years of scraping and research. Therefore, creating an AI model, as the authors point out, is expensive as well as time-consuming. Companies such as Google, Microsoft, Meta etc. already possess large swathes of data and endless resources required to venture into AI model-making. Further, with recent lawsuits, it is possible that moving forward AI companies will have to pay a hefty license fee before they can use copyrighted work to train their model. Recently Taylor & Francis inked a license deal with Microsoft which allows the latter to use its works for training its AI model. As a result, it becomes prohibitive for newer players to venture into AI-model-making. Authors refer to this as ‘Knowledge Distillation’, where, using the acquired dataset of model behaviors’ from a target model, researchers create smaller and more efficient models which compete with the original model’s ability to deal with specific tasks. Training newer models using synthetic data from existing models is also easier, safer and more efficient. Therefore, it is possible to create customized models tailored to specific fields such as medicine, academics etc. This democratizes the so-far exclusive AI-model-making market currently being occupied by tech giants.  Recently, India too has witnessed indigenous LLM AI models being launched- OLA’s Krutrim, HanoomanAI, SarvamAI. However, the quality of these models has been questioned over the past year. (herehere) Apart from the inherent hypocrisy in preventing others from using synthetic data to develop competing AI models, there are good public policy reasons for not enforcing such anti-competitive covenants. As the Authors note: “allowing centralization of power in a few AI companies is an undesirable—and perhaps even dangerous—proposition.”

Read More