Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Slight Corruption in Pre-training Data Makes Better Diffusion Models

Authors: Hao Chen, Yujin Han, Diganta Misra, Xiang Li, Kai Hu, Difan Zou, Masashi Sugiyama, Jindong Wang, Bhiksha Raj

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper presents the first comprehensive study on the impact of such condition corruption in pre-training data of DMs. We synthetically corrupt Image Net-1K and CC3M to pre-train and evaluate over 50 conditional DMs. Our empirical findings reveal that various types of slight corruption in pre-training can significantly enhance the quality, diversity, and fidelity of the generated images across different DMs, both during pre-training and downstream adaptation stages.
Researcher Affiliation Academia 1Carnegie Mellon University, 2The University of Hong Kong,3Mila Quebec AI Institute, 4 RIKEN AIP, 5The University of Tokyo, 6William & Mary, 7MBZUAI
Pseudocode No The paper does not include a clearly labeled "Pseudocode" or "Algorithm" block. Equations and descriptions of methods are provided, but not in pseudocode format.
Open Source Code Yes all models are released at https://huggingface.co/Diffusion Noise.
Open Datasets Yes We synthetically corrupt Image Net-1K and CC3M to pre-train and evaluate over 50 conditional DMs. More specifically, we train class-conditional and text-conditional LDM-4 from scratch on synthetically corrupted IN-1K [50] and CC3M [14]
Dataset Splits Yes We use IN-1K class labels for class-conditional LDMs and MS-COCO text prompts [60] for text-conditional LDMs to generate 50K images and compare with the real validation images.
Hardware Specification Yes Training IN-1K LDMs for 178K iterations takes about 2.5 days on 8 NVIDIA A100.
Software Dependencies No The paper mentions software like "pre-trained BERT [59]" (bert-base-uncased) and "Diffusers [155]" but does not provide specific version numbers for these software components, which is required for a reproducible description.
Experiment Setup Yes We use a class embedding layer and a learnable pre-trained BERT [59] to compute the conditional embeddings of the IN-1K class labels and the CC3M text prompts. To introduce synthetic corruption into the conditions, we randomly flip the class label into a random class for IN-1K, and randomly swap the text of two sampled image-text pairs for CC3M, following [48, 49] (other corruption types studied in Section 3.3). We train models with different corruption ratios η {0, 2.5, 5, 7.5, 10, 15, 20}%. More details on synthetic corruption and pre-training recipes are shown in Appendix B.1 and B.3. Table 3: Hyper-parameters of IN-1K class-conditional and CC3M text-conditional LDMs. (Includes Down-sampling Factor, Latent Shape, Vocabulary Size, Diffusion Steps, Noise Schedule, U-Net Param. Size, Condition Net, Channels, Channel Multipler, Number of Heads, Batch Size, Training Iter., Learning Rate)