[Part II] ANI v. Open AI – The Storage Paradox is More Than Just Transient!

Following the discussion on non-applicability of ‘derivative work’ theory in the Indian context in light of the ANI v. OpenAI case, in Part of her post, Shama Mahajan argues that the Fair Use defence of incidental or transient storage will be weak against the infringement allegations, given the dynamics of how the data processing and storage work in Gen-AI models. Shama is an LL.M Candidate at National University of Singapore, pursuing her masters in Intellectual Property and Technology Law.

A meme featuring a scene from a car with a man gesturing while speaking, alongside the text: 'RELYING TOO MUCH ON FAIR USE FOR STORAGE? IT'D BE A LOT COOLER IF YOU DIDN'T.'

ANI v. Open AI – The Storage Paradox is More Than Just Transient!

By Shama Mahajan

Continuing the discussion on the ANI v. OpenAI case, in this post I want to talk about the first issue which concerns storage. For convenience I am reproducing the issue again below-

Whether the storage by the defendants, of plaintiff’s data (which is in the nature of news and is claimed to be protected under the Copyright Act, 1957) for training its software i.e., ChatGPT, would amount to infringement of plaintiff’s copyright.

The aim here is to analyse the existing copyright law and its application to the Gen-AI systems from the storage aspect of the training data sets which are used in these models and understand the issues where law and technology will intersect demanding a more nuanced application of the legal principles.

Storage: What is taken? What is retained?

Before deciding on whether storage of the copyrighted material is infringement, the Court will have to do an analysis of ‘storage’ itself in the light of the functional dynamics of the Gen-AI systems like ChatGPT. This discussion would thus entail:

Whether ChatGPT stores any data on which copyright claim is exercised?
In what format/form is the data stored?
How long is the data being stored?

The reason why ‘storage’ gains so much significance is owing to Section 14(a)(i) of the Indian Copyright Act, 1957, whose phraseology includes storage in the ambit of reproduction and thus collectively constitutes an exclusive right of the copyright owner. In the traditional sense, the landscape of storage was more or less defined. However, as and when required like with the advent of internet and broadcasting the boundaries of ‘storage’ have been modified as a response to technological advancement.

With Gen-AI systems, it is understood that the building blocks are the ‘datasets’. These datasets are gathered through various sources for the purpose of training the Gen-AI systems. While taking, the data is indeed being seen by us as, material which may or may not be protected under copyright. However, is the Gen-AI retaining what it is taking in the same manner/form in which it is taken? The answer would be no.

The first lesson while you start learning computer programming is that ‘Computer only understands binary language of 0 and 1’. Similarly, even the Gen-AI systems, require that all the data which would be in varied formats like text, image, sound, numerical be all converted to numerical representations. As highlighted in this paper, even though a lot of data is scraped including copyrighted material, the training in itself does not use the data in the same form. Prior to training the data is converted into numerical representation.

Therefore, the process of storage begins only post conversion of the data from its original expressive form (understandable to humans) to numerical representation (understandable to the machine learning systems). Therefore, answering the question, is the copyrighted material stored by ChatGPT? is not an absolute yes or no, but rather a conditional yes.

Indian Courts while examining what constitutes ‘reproduction’ have held that ‘it applies only to those reproductions that objectively replicate or copy the form of expression of the original work’. Since, reproduction also includes storage in any medium, the same principle if is extended to the Gen-AI context, it would mean that, if copyrighted expression is not replicated in the same form while it is being stored, such storage would not constitute reproduction. This argument can be further strengthened by the fact that Copyright protects the expression and not the idea. When the copyrighted expression is converted into a form that does not resemble it at all, (what is resulting is non-expressive material on which the training of Gen-AI models takes place) will the copyright continue to protect the same?

It is a Copy when you see the Copy!

We can take the aid of the Indian Copyright Act to answer the question above. The definition of ‘infringing copy’ as defined u/s 2(m) of the Indian Copyright Act, collectively indicates that the intent of infringing is directed towards the ‘expressive form’ of the copyrighted work in question. The Court has also held that reproduction must entail replication of the ‘expressive form’ and whether an average viewer, would get an unmistakable impression that one work was a copy of the other. This indicates that, for infringement to occur, in addition to reproduction of the expressive form there must also be publication of it whereby it can be accessed, seen and consumed by the public.

The Gen-AI models store the datasets post their conversion in what can be called as ‘machine-readable form’, and the decision of its storage is taken by the model-trainers who decide what data will the models be trained on. Thereby, once datasets are determined and model begins the training, there is no ‘publication of that dataset’ by the Gen-AI systems to any user/public.

The question that arises next is, how does ‘storage’ then fit in the above analogy of ‘public consumption’ or ‘publication’. Interestingly the Indian courts have faced a similar storage related issue in two instances. First, in My Space Inc. and then subsequently in Tips Industries Ltd.From, the facts and the outcomes of these disputes, it is amply clear that ‘storage’ when read ejusdem generis with ‘reproduction’ refers to ‘storage that allows access to the stored copyrighted content’ resulting in infringement. Hence, storage which allows unauthorized access to copyrighted content, for example, third-party platforms that store music and allow download on personal device, will amount to a storage that is infringing. In other words, storage that does not allow access to the stored content to the public, under the principles laid down by Indian judiciary itself, will not be infringing.

The aforementioned storage can also qualify as personal use as held in a US case law where recording and storage of broadcasted content on Video Tape Recorders by an individual for later use was held to be non-infringing.

A distinction can be drawn between ‘storing of copyrighted material in any medium’ vis-à-vis ‘storing of the copyrighted material in any form’. The emphasis has been given on the medium of storage i.e. hard drive, online servers, websites, third-party apps etc. which connotes, that, ‘the storage analysis is limited to the content’s medium of storage’. Whereas, the reproduction/infringing copy analysis encompasses the form of the content. Thus, from the interpretation of ‘reproduction’ and ‘infringing copy’ as per Indian courts when juxtaposed against ‘storage in any medium’ the outcome would be:

“If the material in question does not replicate the expressive form of the copyrighted work to the public (the ‘form analysis’), then its storage in any medium, will not constitute infringement (storage analysis)”

In such a scenario, the usage of copyrighted content and its storage for training of the Gen-AI models like ChatGPT would not constitute infringement. The argument of such conversion into machine-readable format as ‘adaptation’ would not stand, given that understanding of adaptation is rooted in the ability of the copyright holder to execute/carry-out such adaptation which can be accessed and consumed by the public. The conversion of copyrighted material into numerical reorientation by a machine does not satisfy this understanding and intent.

Storage: How much is too much?

The other question that arises in examining this issue, is whether the storage of copyrighted content during training can be called incidental and/or transient. This becomes relevant from the ‘fair use’ perspective which is to say that even if Courts hold that storage as it happens in the case of Gen-AI models is an infringement, yet, it is protected by fair use. This question, has two possible forms of datasets to be considered:

The copyrighted expression when scarped prior to being converted into machine-readable forms.
The dataset post it is converted into machine-readable forms.

In the first instance, the copyrighted material in its expressive form would qualify as transient and incidental storage. As per the decision of the Court in MySpace Inc. ‘transient storage’ refers to temporary storage whereas ‘incidental storage’ means subordinate to something of greater significance. In the Gen-AI Supply Chain, the datasets consisting of copyrighted expression when scraped from the sources across the web, are stored temporarily in the said form prior to conversion into machine readable form and this stage is incidental to the process of pre-training the Gen-AI Models. The 227^th Report of the Parliamentary Standing Committee on the Copyright (Amendment) Bill, 2010 also refers to this understanding while explaining the storage that occurs as part of ‘caching’, which is considered important for improving the core functionality.

The matter gets slightly tricky in the second instance. The datasets once are converted into the machine-readable form, the model is trained based on the datasets chosen by the trainers. These models store these datasets for as long as anyone cares to keep the copy of the model. Thus, in this case the storage is indeed not transient.

However, under the Indian Copyright Act, the storage must either be transient OR incidental. The decision to replace ‘and with ‘or’ was taken, to address the unlimited liability concerns raised by ISPs for third party actions. Thus, even if not transient can storage of the datasets in this case be considered incidental.

The Indian jurisprudence on incidental copying is limited to the two cases, of MySpace and Tips Industries Inc. In this case, assistance may be drawn from international jurisprudence of the EU and UK. The requirement of ‘incidental storage’ features in Article 5(1) of the Information Society Directive under the interpretative guidance of the Court of Justice. This provision lays down 5 cumulative conditions:

(a) copying ought to be transient or incidental; (b) it should be an integral and essential part of a technological process; (c) the sole purpose of such a process should be to enable either a transmission in a network between third parties by an intermediary or a lawful use; and (d) such an activity should have no independent economic significance.

Under this 5-factor analysis, ‘incidental’ copies are those which last longer than ‘transient’, but still remain temporary and are deleted without any human intervention. Incidental copies are considered to have no significance from copyright perspective, in a sense that it neither exists independently of, nor has a purpose independent of, the technological process of which it forms part.

Thus, the Indian understanding of ‘incidental storage’ is similar to the understanding in the EU and UK except the requirement of automatic deletion of the data stored without human intervention. OpenAI would be in a position to justify that the datasets which are stored for training the Gen-AI systems are indeed not independent of the technological process involved, and are necessary component of the overall process as it enables correct and efficient functioning. This is clearer from the functional understanding of how these models are trained. However, the said data is not deleted as highlighted earlier. Even though, the requirement under Indian law is either transient or incidental, applying the ejusdem generis principle would ultimately favour the understanding that even incidental storage can’t be permanent in nature.

The outcome of this would also vary if the data stored, is more in the nature of meta-information or learning from the datasets rather than the data-set itself. In this case, upon extraction of meta-information of the dataset (numerical representation of the original copyrighted expression), if it is destructed automatically then the storage would qualify as incidental storage. The other consideration that might weigh on court’s mind is whether the training dataset itself was an infringing copy for example, books from shadow libraries. The element of knowledge and control in choosing the datasets here might have influence on how strictly the court handles this.

To Conclude…

The fair-use analysis of training datasets is weaker in the light of the existing law as it stands today in India. However, whether storage for training itself would constitute infringement, in my opinion if viewed from a Gen-AI Supply Chain perspective and the idea vis-à-vis expression angle, is more likely to favour OpenAI.

Legal 60

Law News Aggregator