reproducibilityindex.ai

Model Collapse Demystified: The Case of Regression

Authors: Elvis Dohmatob, Yunzhen Feng, Julia Kempe

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our theoretical results are validated with experiments.
Researcher Affiliation	Collaboration	FAIR, Meta Center for Data Science, New York University Courant Institue of Mathematical Sciences, New York University
Pseudocode	No	No pseudocode or algorithm blocks were found in the paper.
Open Source Code	No	We only use one publicly available dataset, MNIST, and no idiosyncratic model. Thus, we provide neither dataset nor code, as the dataset is publicly available, and the experiments are easy to reproduce from their description.
Open Datasets	Yes	We conduct experiments using kernel ridge regression on the MNIST dataset [16]
Dataset Splits	No	The classification dataset contains 60, 000 training and 10, 000 test data points (handwritten), with labels from 0 to 9 inclusive.
Hardware Specification	No	No specific hardware details (GPU/CPU models, memory) were mentioned for the experimental setup. The acknowledgments only vaguely refer to 'NYU IT High Performance Computing (HPC) resources, services, and staff expertise'.
Software Dependencies	No	No specific software dependencies with version numbers (e.g., Python, PyTorch, scikit-learn versions) were mentioned in the paper.
Experiment Setup	Yes	Specifically, the models were trained using stochastic gradient descent (SGD) with a batch size of 128 and a learning rate of 0.1. We employed a regression setting where labels were converted to one-hot vectors, and the model was trained using mean squared error for 200 epochs to convergence. When generating the synthetic data, Gaussian label noise with a standard deviation of 0.1 is added.