reproducibilityindex.ai

On the Role of Optimization in Double Descent: A Least Squares Study

Authors: Ilja Kuzborskij, Csaba Szepesvari, Omar Rivasplata, Amal Rannen-Triki, Razvan Pascanu

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically it has been observed that the performance of deep neural networks steadily improves with increased model size, contradicting the classical view on overﬁtting and generalization. ... We empirically explore if our predictions hold for neural networks, in particular whether the spectrum of the sample covariance of features at intermediary hidden layers has a similar behaviour as the one predicted by our derivations in the least squares setting. ... Fig. 1 provides a summary of our ﬁndings. ... Fig. 2 provides the main ﬁndings on this experiment. Similar to the Fig. 1, we depict 3 columns showing snapshots at different number of gradient updates: 1000, 10000 and 100000. The ﬁrst row shows test error (number of miss-classiﬁed examples out of the test examples) computed on the full test set of 10000 data points which as expected shows the double descent curve with a peak around 1000 hidden units.
Researcher Affiliation	Collaboration	Ilja Kuzborskij Deep Mind Csaba Szepesvári University of Alberta Deep Mind Edmonton, Canada Omar Rivasplata University College London Amal Rannen-Triki Deep Mind Razvan Pascanu Deep Mind
Pseudocode	No	The paper describes the gradient descent update rule in text but does not provide structured pseudocode or an algorithm block.
Open Source Code	No	The paper does not provide any explicit statement or link indicating the availability of open-source code for the described methodology.
Open Datasets	Yes	We focus on one hidden layer MLPs on the MNIST and Fashion MNIST datasets. ... and a training set of 1000 randomly chosen examples for both datasets.
Dataset Splits	No	The paper explicitly mentions using a training set and a test set, but it does not specify details for a validation set or provide comprehensive dataset split percentages for all three.
Hardware Specification	No	The paper does not mention specific hardware components (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies	No	The paper does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the experiments.
Experiment Setup	Yes	GD is run with learning rate α = 0.05 and initialization drawn from N(0, 1/d I). ... We follow the protocol used by Belkin et al. [2019], relying on a squared error loss. In order to increase the model size we simply increase the dimensionality of the latent space, and rely on gradient descent with a ﬁxed learning rate and a training set of 1000 randomly chosen examples for both datasets. More details can be found in Appendix G.