Understanding and Mitigating Copying in Diffusion Models
Authors: Gowthami Somepalli, Vasu Singla, Micah Goldblum, Jonas Geiping, Tom Goldstein
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we first analyze this memorization problem in text-to-image diffusion models. While it is widely believed that duplicated images in the training set are responsible for content replication at inference time, we observe that the text conditioning of the model plays a similarly important role. In fact, we see in our experiments that data replication often does not happen for unconditional models, while it is common in the text-conditional case. Motivated by our findings, we then propose several techniques for reducing data replication at both training and inference time by randomizing and augmenting image captions in the training set. |
| Researcher Affiliation | Academia | Gowthami Somepalli 1, Vasu Singla 1, Micah Goldblum 2, Jonas Geiping 1, Tom Goldstein 1 1 University of Maryland, College Park {gowthami, vsingla, jgeiping, tomg}@cs.umd.edu 2 New York University goldblum@nyu.edu |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/somepago/DCR. |
| Open Datasets | Yes | We use Imagenette1, which consists of 10 classes from Imagenet [Deng et al., 2009] as well as two randomly sampled subsets of 10, 000 and 100, 000 images from LAION-2B [Schuhmann et al., 2022] for our experiments. |
| Dataset Splits | No | The paper mentions 'validation' in the schema but does not provide explicit validation dataset splits (e.g., percentages, counts) in the text. |
| Hardware Specification | Yes | We used one RTX-A6000 per model and it took about 24 hours to train. For inference, we used RTX-A5000 and it took approximately 5 hours to create enough generations to compute our metrics. |
| Software Dependencies | No | The paper mentions several software components like BLIP, CLIP, SSCD, U-Net, and Adam optimizer, but does not provide specific version numbers for them. |
| Experiment Setup | Yes | Unless otherwise noted, only the U-Net [Ronneberger et al., 2015] part of the pipeline is finetuned (the text and auto-encoder/decoder components are frozen) as in the original training run, and we finetune for 100000 iterations with a constant learning rate of 5e 6 and 10000 steps of warmup. All models are trained with batch size 16 and image resolution 256. |