reproducibilityindex.ai

On Provable Copyright Protection for Generative Models

Authors: Nikhil Vyas, Sham M. Kakade, Boaz Barak

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We also show promising experiments on language and image generative models, demonstrating that our modified model does not degrade significantly in quality (and in fact, it may even improve in some cases). See Figure 1 for one example and Section 4 for more details. ... Section 4 provides a brief experimental validation
Researcher Affiliation	Academia	Nikhil Vyas 1 Sham Kakade 2 Boaz Barak 1 *Equal contribution 1Harvard School of Engineering and Applied Sciences 2Harvard School of Engineering and Applied Sciences and Kempner Institute for the Study of Natural and Artificial Intelligence. Correspondence to: Nikhil Vyas <nikhil@g.harvard.edu>, Sham Kakade <sham@seas.harvard.edu>, Boaz Barak <b@boazbarak.org>.
Pseudocode	Yes	Algorithm 1 leave-one-out-safe", "Algorithm 2 sharded-safe", "Algorithm 3 CP", "Algorithm 4 CP-k", "Algorithm 5 smooth-CP-k
Open Source Code	No	The paper does not provide an explicit statement or link for the open-source code of the specific methodology (CP, CP-k, smooth-CP-k) described within this paper. It mentions using existing open-source models like 'Mosaic Large Language Models. https://github.com/mosaicml/examples/tree/main/llm, 2022.'
Open Datasets	Yes	We train U-net based diffusion models... on the full CIFAR-10 dataset... The dataset we use is CIFAR-10... We use the C4 dataset (Raffel et al., 2019)...
Dataset Splits	Yes	Our algorithm starts by splitting this dataset into two disjoint datasets, making sure that copyrighted images are split into two different shards; for illustrative purposes, we do not deduplicate the dataset. The procedure then trains two models q1, q2 on these disjoint shards. (from Figure 1 description) and sharded-safe: Partition D into D1 and D2. (Algorithm 2). For m>1 case: partitions D into D1, . . . Dm+1 datasets (Algorithm 6).
Hardware Specification	No	The paper does not specify the exact hardware used for experiments, such as GPU or CPU models.
Software Dependencies	No	The paper mentions 'U-net based diffusion models (specifically based on Yi-Lun Wu (2021))' and 'decoder-only transformers similar to GPT models (specifically (Mosaic ML, 2022))' but does not provide specific version numbers for software dependencies like PyTorch, TensorFlow, or specific library versions.
Experiment Setup	Yes	We now present experiments with CP-k using a threshold of k = 500 to obtain the model pk. ... we use the same values of noise in the diffusion process (while training) for all the models (ensured by using the same random seed in training q1, q2 and p)