On Provable Copyright Protection for Generative Models
Authors: Nikhil Vyas, Sham M. Kakade, Boaz Barak
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also show promising experiments on language and image generative models, demonstrating that our modified model does not degrade significantly in quality (and in fact, it may even improve in some cases). See Figure 1 for one example and Section 4 for more details. ... Section 4 provides a brief experimental validation |
| Researcher Affiliation | Academia | Nikhil Vyas 1 Sham Kakade 2 Boaz Barak 1 *Equal contribution 1Harvard School of Engineering and Applied Sciences 2Harvard School of Engineering and Applied Sciences and Kempner Institute for the Study of Natural and Artificial Intelligence. Correspondence to: Nikhil Vyas <nikhil@g.harvard.edu>, Sham Kakade <sham@seas.harvard.edu>, Boaz Barak <b@boazbarak.org>. |
| Pseudocode | Yes | Algorithm 1 leave-one-out-safe", "Algorithm 2 sharded-safe", "Algorithm 3 CP", "Algorithm 4 CP-k", "Algorithm 5 smooth-CP-k |
| Open Source Code | No | The paper does not provide an explicit statement or link for the open-source code of the specific methodology (CP, CP-k, smooth-CP-k) described within this paper. It mentions using existing open-source models like 'Mosaic Large Language Models. https://github.com/mosaicml/examples/tree/main/llm, 2022.' |
| Open Datasets | Yes | We train U-net based diffusion models... on the full CIFAR-10 dataset... The dataset we use is CIFAR-10... We use the C4 dataset (Raffel et al., 2019)... |
| Dataset Splits | Yes | Our algorithm starts by splitting this dataset into two disjoint datasets, making sure that copyrighted images are split into two different shards; for illustrative purposes, we do not deduplicate the dataset. The procedure then trains two models q1, q2 on these disjoint shards. (from Figure 1 description) and sharded-safe: Partition D into D1 and D2. (Algorithm 2). For m>1 case: partitions D into D1, . . . Dm+1 datasets (Algorithm 6). |
| Hardware Specification | No | The paper does not specify the exact hardware used for experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions 'U-net based diffusion models (specifically based on Yi-Lun Wu (2021))' and 'decoder-only transformers similar to GPT models (specifically (Mosaic ML, 2022))' but does not provide specific version numbers for software dependencies like PyTorch, TensorFlow, or specific library versions. |
| Experiment Setup | Yes | We now present experiments with CP-k using a threshold of k = 500 to obtain the model pk. ... we use the same values of noise in the diffusion process (while training) for all the models (ensured by using the same random seed in training q1, q2 and p) |