
This Part II of this post. The previous post can be accessed here.
When I say copyright, it means just what I choose it to mean- nothing more nor less.
In a nutshell, during the training, the LLM decomposes, abstracts, and constructs not text, but representations of relationships common to the tokens it generated from the earlier text! Now, one obvious question that arises from the copyright infringement perspective is this- once the text is converted into tokens, given a token ID, abstracted into numeric representations being vectors and word embeddings- Is any ‘expression’, which the copyright ostensibly seeks to protect, left in the work? There seems to be little doubt that ChatGPT has ‘used’ the copyrighted text. But is all use of the text protected by copyright? For instance, any text embodies the following:

It is well established that copyright over a work does not give exclusive rights over the idea imbued in the text. Similarly, the meaning of the words used in the text (semantics), and the grammatical arrangement of words (syntax) fall beyond the ambit of copyright protection. The only thing copyright protects is the ‘expression’ (I don’t think a source is needed for this). Now, when the LLM devours the text’s semantics, syntax, conceptual relationships and other underlying features, doesn’t it seem too far-fetched for any author to argue that she has ‘exclusive rights’ over them? Aren’t these elements, ideally speaking, “non-expressive” part of the work?
A language model merely compresses and cross-references linguistic information to identify predictable patterns and reduce redundancies by representing meaning probabilistically. During the pre-training, the ‘identity’ and ‘wholesomeness’ of copyrighted works are lost, and they are stripped of everything but their raw linguistic essence, functionality and utility. The compression only captures the relational meaning in mathematical form. The aspects of the copyrighted work ‘used’ by the model during pre-training are the mathematical representation of word relationships. Thus, pre-training ‘transcends’ the limit of copyright as it abstracts text into multi-dimensional numeric representations and patterns. Copyright can only protect the original expression, not the statistical relationships between words irrespective of the source being copyright-protected. In an article published way back in 2019 in the Journal of the Copyright Society, Prof. Matthew Sag, while giving the examples of non-expressive use, cited using software to identify patterns of speech, relationships, or frequency of particular words as possible instances of non-expressive use (pp.301 & 302).
The 2nd Circuit Court of the US comes to a similar-ish finding, though from a different perspective in Authors Guild Inc.v. Hathitrust (2014). The Court holds that the copyright owner cannot assert his copyright against a text-searchable database holding the copyrighted work without authorisation, as “the result of word search is different in purpose, character, expression, meaning and message from the page (and the book) from which it is drawn.” Pertinently, the 2ndCircuit gave a finding on fair use in the case, holding the use to be transformative, without examining whether there was copyright infringement in the first place. Calling an act as fair use when the purpose is changed, but my point is that when the ‘expression’ and ‘meaning’ of the copyrighted work are changed, such use ‘transcends’ the ambit of the right called copyright.
Certain jurisdictions, such as Singapore, have a Computational Data Analysis (CDA) exception (s.243) allowing identifying, extracting, and analysing information to improve the functioning of a computer program. For this particular ‘use’, the statute even allows making a copy of the work in question (s.244). India has no Text and Data Mining (TDM) or CDA exception. However, the use of non-expressive elements of copyrighted work is within the concept of copyright, and there is no per se need for a statutory exception. Prof. Tim Dornis echos the view that TDM or CDA does not require an exception, in his recent paper He further adds that the issue of infringement props up only because non-protectable non-expressive information is embedded in a copyright ‘container’ or ‘shell’ (p.7). He further says that the underlying aim of the exceptions is to legalise copies and reproductions that precede TDM. However, Prof. Dornis has a more fundamental beef with the proposition being canvassed here (p.11). He argues that since AI does not differentiate between semantic (non-expressive) and syntactic (apparently expressive) information during training, it infringes copyright. However, “somewhat surprisingly” (to quote Mikolov et al.), in his entire paper, he does not explain the basis for calling syntax a subject matter of copyright. It is inconceivable that anyone could monopolize grammatical structure and arrangement of words in a language.
Cautions and Disclaimer
There will no conclusion to this post. The final word on LLMs is yet to be spoken. However, it is important to put out a few words of caution. The word AI is an attention grabber (that is why it was used in the title of the post) but has little substance to offer (that is why it has not been used in the body of post). Instead of making broad stroke arguments about AI, it would be more instructive for legal academicians to deal with specifics. For example, what I say here is very specific to text and language and would certainly not be applicable to Midjourney, Stable Diffusion and DALL E. Judges, lawyers and policy makers will have to appreciate the nitty-gritty and not generalize. Justice cannot be based on assumptions. Generalizations create prejudice, not fairness. The discussion on LLMs and copyright cannot and should not be resolved with superficial understanding of AI. Saying it is a black box is just not enough.
Also, a point worth reiterating is that the discussion has been limited to pre-training stage. It does not delve into fine-tuning, RLHF, storage and most importantly, the output stage. Subsequent posts ‘might’ follow dealing with these topics building upon the ideas discussed here.