Understanding and Mitigating Copying in Diffusion Models

Authors: Gowthami Somepalli, Vasu Singla, Micah Goldblum, Jonas Geiping, Tom Goldstein

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we first analyze this memorization problem in text-to-image diffusion models. While it is widely believed that duplicated images in the training set are responsible for content replication at inference time, we observe that the text conditioning of the model plays a similarly important role. In fact, we see in our experiments that data replication often does not happen for unconditional models, while it is common in the text-conditional case. Motivated by our findings, we then propose several techniques for reducing data replication at both training and inference time by randomizing and augmenting image captions in the training set.
Researcher Affiliation Academia Gowthami Somepalli 1, Vasu Singla 1, Micah Goldblum 2, Jonas Geiping 1, Tom Goldstein 1 1 University of Maryland, College Park {gowthami, vsingla, jgeiping, tomg}@cs.umd.edu 2 New York University goldblum@nyu.edu
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/somepago/DCR.
Open Datasets Yes We use Imagenette1, which consists of 10 classes from Imagenet [Deng et al., 2009] as well as two randomly sampled subsets of 10, 000 and 100, 000 images from LAION-2B [Schuhmann et al., 2022] for our experiments.
Dataset Splits No The paper mentions 'validation' in the schema but does not provide explicit validation dataset splits (e.g., percentages, counts) in the text.
Hardware Specification Yes We used one RTX-A6000 per model and it took about 24 hours to train. For inference, we used RTX-A5000 and it took approximately 5 hours to create enough generations to compute our metrics.
Software Dependencies No The paper mentions several software components like BLIP, CLIP, SSCD, U-Net, and Adam optimizer, but does not provide specific version numbers for them.
Experiment Setup Yes Unless otherwise noted, only the U-Net [Ronneberger et al., 2015] part of the pipeline is finetuned (the text and auto-encoder/decoder components are frozen) as in the original training run, and we finetune for 100000 iterations with a constant learning rate of 5e 6 and 10000 steps of warmup. All models are trained with batch size 16 and image resolution 256.