Slight Corruption in Pre-training Data Makes Better Diffusion Models

Authors: Hao Chen, Yujin Han, Diganta Misra, Xiang Li, Kai Hu, Difan Zou, Masashi Sugiyama, Jindong Wang, Bhiksha Raj

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper presents the first comprehensive study on the impact of such condition corruption in pre-training data of DMs. We synthetically corrupt Image Net-1K and CC3M to pre-train and evaluate over 50 conditional DMs. Our empirical findings reveal that various types of slight corruption in pre-training can significantly enhance the quality, diversity, and fidelity of the generated images across different DMs, both during pre-training and downstream adaptation stages.
Researcher Affiliation Academia 1Carnegie Mellon University, 2The University of Hong Kong,3Mila Quebec AI Institute, 4 RIKEN AIP, 5The University of Tokyo, 6William & Mary, 7MBZUAI
Pseudocode No The paper does not include a clearly labeled "Pseudocode" or "Algorithm" block. Equations and descriptions of methods are provided, but not in pseudocode format.
Open Source Code Yes all models are released at https://huggingface.co/Diffusion Noise.
Open Datasets Yes We synthetically corrupt Image Net-1K and CC3M to pre-train and evaluate over 50 conditional DMs. More specifically, we train class-conditional and text-conditional LDM-4 from scratch on synthetically corrupted IN-1K [50] and CC3M [14]
Dataset Splits Yes We use IN-1K class labels for class-conditional LDMs and MS-COCO text prompts [60] for text-conditional LDMs to generate 50K images and compare with the real validation images.
Hardware Specification Yes Training IN-1K LDMs for 178K iterations takes about 2.5 days on 8 NVIDIA A100.
Software Dependencies No The paper mentions software like "pre-trained BERT [59]" (bert-base-uncased) and "Diffusers [155]" but does not provide specific version numbers for these software components, which is required for a reproducible description.
Experiment Setup Yes We use a class embedding layer and a learnable pre-trained BERT [59] to compute the conditional embeddings of the IN-1K class labels and the CC3M text prompts. To introduce synthetic corruption into the conditions, we randomly flip the class label into a random class for IN-1K, and randomly swap the text of two sampled image-text pairs for CC3M, following [48, 49] (other corruption types studied in Section 3.3). We train models with different corruption ratios η {0, 2.5, 5, 7.5, 10, 15, 20}%. More details on synthetic corruption and pre-training recipes are shown in Appendix B.1 and B.3. Table 3: Hyper-parameters of IN-1K class-conditional and CC3M text-conditional LDMs. (Includes Down-sampling Factor, Latent Shape, Vocabulary Size, Diffusion Steps, Noise Schedule, U-Net Param. Size, Condition Net, Channels, Channel Multipler, Number of Heads, Batch Size, Training Iter., Learning Rate)