On the Role of Optimization in Double Descent: A Least Squares Study

Authors: Ilja Kuzborskij, Csaba Szepesvari, Omar Rivasplata, Amal Rannen-Triki, Razvan Pascanu

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically it has been observed that the performance of deep neural networks steadily improves with increased model size, contradicting the classical view on overfitting and generalization. ... We empirically explore if our predictions hold for neural networks, in particular whether the spectrum of the sample covariance of features at intermediary hidden layers has a similar behaviour as the one predicted by our derivations in the least squares setting. ... Fig. 1 provides a summary of our findings. ... Fig. 2 provides the main findings on this experiment. Similar to the Fig. 1, we depict 3 columns showing snapshots at different number of gradient updates: 1000, 10000 and 100000. The first row shows test error (number of miss-classified examples out of the test examples) computed on the full test set of 10000 data points which as expected shows the double descent curve with a peak around 1000 hidden units.
Researcher Affiliation Collaboration Ilja Kuzborskij Deep Mind Csaba Szepesvári University of Alberta Deep Mind Edmonton, Canada Omar Rivasplata University College London Amal Rannen-Triki Deep Mind Razvan Pascanu Deep Mind
Pseudocode No The paper describes the gradient descent update rule in text but does not provide structured pseudocode or an algorithm block.
Open Source Code No The paper does not provide any explicit statement or link indicating the availability of open-source code for the described methodology.
Open Datasets Yes We focus on one hidden layer MLPs on the MNIST and Fashion MNIST datasets. ... and a training set of 1000 randomly chosen examples for both datasets.
Dataset Splits No The paper explicitly mentions using a training set and a test set, but it does not specify details for a validation set or provide comprehensive dataset split percentages for all three.
Hardware Specification No The paper does not mention specific hardware components (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies No The paper does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the experiments.
Experiment Setup Yes GD is run with learning rate α = 0.05 and initialization drawn from N(0, 1/d I). ... We follow the protocol used by Belkin et al. [2019], relying on a squared error loss. In order to increase the model size we simply increase the dimensionality of the latent space, and rely on gradient descent with a fixed learning rate and a training set of 1000 randomly chosen examples for both datasets. More details can be found in Appendix G.