Copyright Protection and Generative Models - Part Two

In this second part, we look at the mechanisms for copyright protection for generative models.

Like the first part of this blog, this blog is also based on the paper “Provable Copyright Protection for Generative Models”

To recap from the first blog:

The question of Copyright protection is important for generative models
Generative models hold much promise for novel content creation(code, text, images, video)
However, these models are trained on vast quantities of data, much of which is copyrighted.
The question of the legal permissibility of using the sampled outputs of such models remains unsolved.
If you use a generated code in your program or a generated art in your design, how can you be sure it is not substantially similar to some copyrighted work from the training set, and with all the legal and ethical implications this entails?

The paper aims to provide a formalism that enables rigorous guarantees on the similarity (and, more importantly, guarantees on the lack of similarity) between the output of a generative model and any potentially copyrighted data in its training set.

But how to implement this?

In the paper, the authors propose the idea of ‘near access freeness.’

Near access freeness is based on the idea that for copyright infringement to occur, we must prove that the defendant had access to the plaintiff’s copyrighted work. Also, the plaintiff needs to prove there are “substantial similarities between the defendant’s work and original elements of the plaintiff’s work.”

In terms of generative models, we could say that to determine access:

Formally defining “substantial similarity” is also not easy. Simple measures such as Hamming distance or verbatim copying are not enough. Instead, the paper uses the idea that generative models are inherently probabilistic. Hence we can use distance measures between distributions that are information-theoretic and agnostic to superficial issues such as pixel-based representations.

Based on these ideas, they formalise NAF as:

According to the authors
This definition reduces the task of determining a copyright infringement to (1) a quantitative question of the acceptable value of , and (2) a qualitative question of providing a function that appropriately satisfies a no access condition. Both can be application-dependent: the number of bits that constitute copyrightable content differs between, e.g., poems and images, and the function could also differ based on application.

References and image source
https://windowsontheory.org/2023/02/21/provable-copyright-protection-for-generative-models/

https://arxiv.org/abs/2302.10870

Copyright Protection and Generative Models – Part Two