reproducibilityindex.ai

Identity Crisis: Memorization and Generalization Under Extreme Overparameterization

Authors: Chiyuan Zhang, Samy Bengio, Moritz Hardt, Michael C. Mozer, Yoram Singer

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study the interplay between memorization and generalization of overparameterized networks in the extreme case of a single training example and an identitymapping task. We examine fully-connected and convolutional networks (FCN and CNN), both linear and nonlinear, initialized randomly and then trained to minimize the reconstruction error. The trained networks stereotypically take one of two forms: the constant function (memorization) and the identity function (generalization). We formally characterize generalization in single-layer FCNs and CNNs. We show empirically that different architectures exhibit strikingly different inductive biases.
Researcher Affiliation	Collaboration	Chiyuan Zhang & Samy Bengio Google Research, Brain Team Mountain View, CA 94043, USA {chiyuan,bengio}@google.com Moritz Hardt University of California, Berkeley Berkeley, CA 94720, USA hardt@berkeley.edu Michael C. Mozer Google Research, Brain Team Mountain View, CA 94043, USA mcmozer@google.com Yoram Singer Princeton University Princeton, NJ 08544, USA y.s@princeton.edu
Pseudocode	No	No pseudocode or algorithm blocks were found in the paper.
Open Source Code	No	The paper does not provide any statement or link indicating the availability of open-source code for the methodology described.
Open Datasets	Yes	The main study is done with the MNIST dataset. It consists of grayscale images of hand written digits of size 28 28. For training, we randomly sample one digit from the training set (a digit 7 ) with a ﬁxed random seed. For testing, we use random images from the test set of MNIST and Fashion-MNIST, as well as algorithmically generated structured patterns and random images.
Dataset Splits	No	The paper focuses on experiments with a 'single training example' and 'test examples'. While it mentions using 60k MNIST examples in some cases, it does not provide specific training/validation/test dataset splits (e.g., percentages or counts) or reference to predefined standard splits for reproduction. It simply states that 'All nets perform well on the training set (ﬁrst three columns) and transfer well to novel digits and digit blends (columns 4-6)' for the 60k examples.
Hardware Specification	No	The paper does not specify any particular hardware details such as GPU models, CPU models, or cloud computing instance types used for running the experiments.
Software Dependencies	No	The paper does not specify software versions for libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used for implementation.
Experiment Setup	Yes	The models are trained by minimizing the mean squared error (MSE) loss with a vanilla SGD (base learning rate 0.01 and momentum 0.9). The learning rate is scheduled as stagewise constant that decays with a factor of 0.2 at the 30%, 60% and 80% of the total training steps (2,000,000). No weight decay is applied during training. For neural network architectures, the rectiﬁed linear unit (Re LU) activation is used for both fully connected networks (FCNs) and convolutional networks (CNNs). The hidden dimensions for the FCNs are 2,048 by default. The CNNs use 5x5 kernels with stride 1 and padding 2, so that the geometry does not change after each convolution layer.