Identity Crisis: Memorization and Generalization Under Extreme Overparameterization
Authors: Chiyuan Zhang, Samy Bengio, Moritz Hardt, Michael C. Mozer, Yoram Singer
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We study the interplay between memorization and generalization of overparameterized networks in the extreme case of a single training example and an identitymapping task. We examine fully-connected and convolutional networks (FCN and CNN), both linear and nonlinear, initialized randomly and then trained to minimize the reconstruction error. The trained networks stereotypically take one of two forms: the constant function (memorization) and the identity function (generalization). We formally characterize generalization in single-layer FCNs and CNNs. We show empirically that different architectures exhibit strikingly different inductive biases. |
| Researcher Affiliation | Collaboration | Chiyuan Zhang & Samy Bengio Google Research, Brain Team Mountain View, CA 94043, USA {chiyuan,bengio}@google.com Moritz Hardt University of California, Berkeley Berkeley, CA 94720, USA hardt@berkeley.edu Michael C. Mozer Google Research, Brain Team Mountain View, CA 94043, USA mcmozer@google.com Yoram Singer Princeton University Princeton, NJ 08544, USA y.s@princeton.edu |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | No | The paper does not provide any statement or link indicating the availability of open-source code for the methodology described. |
| Open Datasets | Yes | The main study is done with the MNIST dataset. It consists of grayscale images of hand written digits of size 28 28. For training, we randomly sample one digit from the training set (a digit 7 ) with a fixed random seed. For testing, we use random images from the test set of MNIST and Fashion-MNIST, as well as algorithmically generated structured patterns and random images. |
| Dataset Splits | No | The paper focuses on experiments with a 'single training example' and 'test examples'. While it mentions using 60k MNIST examples in some cases, it does not provide specific training/validation/test dataset splits (e.g., percentages or counts) or reference to predefined standard splits for reproduction. It simply states that 'All nets perform well on the training set (first three columns) and transfer well to novel digits and digit blends (columns 4-6)' for the 60k examples. |
| Hardware Specification | No | The paper does not specify any particular hardware details such as GPU models, CPU models, or cloud computing instance types used for running the experiments. |
| Software Dependencies | No | The paper does not specify software versions for libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used for implementation. |
| Experiment Setup | Yes | The models are trained by minimizing the mean squared error (MSE) loss with a vanilla SGD (base learning rate 0.01 and momentum 0.9). The learning rate is scheduled as stagewise constant that decays with a factor of 0.2 at the 30%, 60% and 80% of the total training steps (2,000,000). No weight decay is applied during training. For neural network architectures, the rectified linear unit (Re LU) activation is used for both fully connected networks (FCNs) and convolutional networks (CNNs). The hidden dimensions for the FCNs are 2,048 by default. The CNNs use 5x5 kernels with stride 1 and padding 2, so that the geometry does not change after each convolution layer. |