
Five Aspects of Generative AI for Analyzing Copyright Infringement and Fair Use
GenAI has five aspects relevant to copyright infringement and fair use: training input, short-term storage, long-term storage, repurposing output, and non-repurposing output
INTRODUCTION
Generative artificial intelligence (GenAI) is a technology that learns from large datasets, identifies relationships within those works, and uses that information to answer queries or produce new content. This article explores copyright concerns related to GenAI by examining five aspects—training input, storage (short-term and long-term), and output (repurposing and non-repurposing)—for legal analysis.
WHAT CONSTITUTES FAIR USE?
Readers of this article undoubtedly are familiar with copyright law. Fair use is a legal doctrine that permits the unlicensed use of copyright-protected works if the copied work meets certain criteria. Section 107 of the Copyright Act outlines four factors used to assess fair use.
1. Purpose and character of the use. A nonprofit or educational work is more likely to be fair use.
2. Nature of the copyrighted work. A less creative work is more likely to be fair use.
3. Amount and substantiality of the portion used in relation to the copyrighted work as a whole: Using a small, unimportant part of the original work is more likely to be fair use.
4. Effect of the use upon the potential market for or value of the copyrighted work: A copied work that does not compete in the marketplace for the original work is more likely to be fair use.
FIVE ASPECTS OF GENAI
GenAI has five aspects relevant to copyright infringement and fair use: training input, short-term storage, long-term storage, repurposing output, and non-repurposing output1.
ASPECT 1: TRAINING INPUT
This article does not consider illegally obtained data such as nonpublic data obtained without a license. Accessing publicly available data does not constitute copyright infringement, as copyright law aims to encourage learning. Reading copyrighted material does not violate copyright holders’ exclusive rights.
ASPECT 2: SHORT-TERM STORAGE
Copyright infringement cases involving technology have defined transitory and non-transitory storage.
According to the Copyright Act, a transitory copy is not considered a copy for purposes of copyright infringement. A general precedent was established by the Second Circuit in Cartoon Network, LP v. CSC Holdings, Inc. (Cablevision) that a copy held in memory for 1.2 seconds is transitory and not copyright infringement. Thus, GenAI systems that keep information in memory long enough to process it do not infringe copyrights.
ASPECT 3: LONG-TERM STORAGE
GenAI systems that maintain their training data in databases meet the criteria of fixed copies and thus infringe copyrights. Is this fair use?
Informative cases are Authors Guild, Inc. v. HathiTrust and Authors Guild, Inc. v. Google, Inc., both before the Second Circuit, and A.V. ex rel. Vanderhye v. iParadigms, LLC before the Fourth Circuit. HathiTrust and Google both scanned books into a database to allow users to search for references within those books. iParadigms archived student research papers to provide a plagiarism detection service.
1. Purpose and Character of the Use
The Second Circuit court determined that both HathiTrust’s and Google’s copying was fair use according to the first factor. The Fourth Circuit similarly decided iParadigms’ copying was fair use according to the first factor. For the same reason, the long-term storage of a GenAI system meets the criteria of fair use according to the first factor.
2. The Nature of the Copyrighted Work
The Second Circuit determined that HathiTrust’s and Google’s copying met the second factor because they transform the original works into something new that does not compete with them. The Fourth Circuit also found iParadigm’s copying to be transformative. In none of these systems could the user obtain any entire work from the database. For these same reasons, the long-term storage of GenAI systems meets the criteria of fair use according to the second factor. We will consider the outputs of the GenAI system later in this article.
3. Amount and Substantiality of the Portion Copied
The Second Circuit determined that both HathiTrust and Google were required to store the books in their entirety because the system would not work otherwise. The Fourth Circuit similarly found that the substantiality of copying by iParadigms was not relevant. The substantiality of copying was justified, just as it is for GenAI systems.
4. Effect Upon the Potential Market For or Value of the Copyrighted Work
The courts determined that the output of the HathiTrust, Google, and iParadigms systems did not reduce the potential market for the original works because they created something different than the original works. Because the storage aspect of a GenAI system has no output, factor four weighs in favor of fair use for GenAI system. We will consider outputs of the GenAI system later in this article.
Based on court precedents, only the output of any particular non-repurposing GenAI system needs to be considered
ASPECT 4: REPURPOSING OUTPUT
Repurposing GenAI systems use one type of training data to generate different outputs. For example, a security system may analyze facial images to send notifications about individuals detected by cameras, or another system may train on DNA data to predict traits and disease risks. Because repurposing GenAI outputs differ in form from their training data, they do not infringe copyrights.
ASPECT 5: NON-REPURPOSING OUTPUT
Non repurposing generative AI systems produce outputs of the same type as their training data. For example, systems trained on documents generate new documents; systems trained on visual art generate artwork; and systems trained on music generate songs. For copyright infringement, the output must be shown to be substantially similar to one or more copyrighted works in the training set. Assessing copyright infringement requires a subjective evaluation of whether the output copies protected expression from the training data.
If it is determined that the output does infringe copyrights, is it fair use according to the four factors?
1. Purpose and Character of the Use
A non-repurposing GenAI system produces output with the same use as its training input, so it does not qualify as fair use according to the first factor.
2. The Nature of the Copyrighted Work
GenAI training on factual works like technical articles or news is more likely to be considered fair use than training on creative content such as novels, movies, or songs.
3. Amount and Substantiality of the Portion Copied
Whether this fair use factor applies to the system will depend on specific outputs of specific non-repurposing GenAI systems and how much of copyrighted training inputs appears in the outputs.
4. Effect Upon the Potential Market for or Value of the Copyrighted Work
The effect on the market is difficult to determine conceptually. How much would a novel in the manner of Hemingway or a painting in the manner of Picasso reduce the market for original works by Hemingway or Picasso?
If the output of a non-repurposing GenAI system is found to infringe copyrights, whether that infringement is allowed as fair use would depend on the specifics of the output.
CONCLUSION
In this article, we have considered five aspects of GenAI systems with regard to copyright infringement and fair use. Based on court precedents, only the output of any particular non-repurposing GenAI system needs to be considered. If it is determined that the outputs infringe copyrights, then a specific examination of the four factors of fair use must be considered to determine whether the copying is allowed.
Disclaimer – The views expressed in this article are the personal views of the author and are purely informative in nature.
1. In previous articles and papers, I referred to seven aspects, but two of those aspects were further broken into two aspects each. I have come to believe that the simplification of five aspects is more appropriate. This is just a nomenclature change, not a substance change.