Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A Closer Look at Model Collapse: From a Generalization-to-Memorization Perspective

Authors: Lianghe Shi, Meng Wu, Huijie Zhang, Zekai Zhang, Molei Tao, Qing Qu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we empirically demonstrate the transition from generalization to memorization that occurs over recursive iterations, and investigate the underlying factors driving this transition. All experiments in this section are conducted on the CIFAR-10 dataset [23] using a UNet-based DDPM model [16], under the replace paradigm, where each model is trained solely on samples generated by the model from the previous iteration. However, in Appendix D, we extend the explorations to other datasets and paradigms, where the conclusion remains valid.
Researcher Affiliation	Academia	Lianghe Shi University of Michigan United States EMAIL Meng Wu University of Michigan United States EMAIL Huijie Zhang University of Michigan United States EMAIL Zekai Zhang University of Michigan United States EMAIL Molei Tao Georgia Institute of Technology United States EMAIL Qing Qu University of Michigan United States EMAIL
Pseudocode	Yes	Appendix B.3 Pseudo-codes for the Algorithms We present the pseudo-code of the Greedy Selection and Threshold Decay Filter methods in Algorithms 1 and 2.
Open Source Code	Yes	The source code is available at https://github.com/shilianghe007/Model_Collapse.git
Open Datasets	Yes	We conduct experiments on three widely used image generation benchmarks. CIFAR-10 [23] consists of 32 32 color images in 10 classes. Due to computational constraints, we use a subset of 32,768 training images. Our goal is not to achieve state-of-the-art FID among large diffusion models but rather to demonstrate that our method mitigates memorization in the self-consuming loop. As shown in Section 3, this subset is sufﬁcient to observe the transition from generalization to memorization. We also conduct experiments on subsets of FFHQ [31], downsampled to 32 32 resolution, and MNIST [32], using 8,192 and 12,000 training images, respectively.
Dataset Splits	No	The paper states: "We conduct experiments on three widely used image generation benchmarks. CIFAR-10 [23]... We use a subset of 32,768 training images... We also conduct experiments on subsets of FFHQ [31]... and MNIST [32], using 8,192 and 12,000 training images, respectively." This describes the size of initial training sets, but does not provide specific train/test/validation splits for reproduction of evaluation results in percentages or counts.
Hardware Specification	Yes	All experiments are conducted on a single NVIDIA A-100 GPU.
Software Dependencies	No	Our implementation is based on the Hugging Face Diffusers codebase [34] of DDPM. We use a mixed-precision training of FP16 to train the models. We adopt an Adam optimizer with a learning rate of 10 4 and a weight decay of 10 6. The batch size is 128. A 1000-step denoising process is used, with all other hyperparameters set to their default values. For Threshold Decay Filter, we use an initial threshold of 60 and a decay rate of 0.95. We show in the Appendix that our method is robust in a wide range of hyperparameters. The paper mentions Hugging Face Diffusers codebase [34] but does not provide a specific version number for this or any other software dependencies like the machine learning framework used (e.g., PyTorch, TensorFlow).
Experiment Setup	Yes	Implementation. For iterative training and sampling, our implementation is based on the Hugging Face Diffusers codebase [34] of DDPM. We use a mixed-precision training of FP16 to train the models. We adopt an Adam optimizer with a learning rate of 10 4 and a weight decay of 10 6. The batch size is 128. A 1000-step denoising process is used, with all other hyperparameters set to their default values. For Threshold Decay Filter, we use an initial threshold of 60 and a decay rate of 0.95. We show in the Appendix that our method is robust in a wide range of hyperparameters.