Extreme Memorization via Scale of Initialization

Authors: Harsh Mehta, Ashok Cutkosky, Behnam Neyshabur

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We construct an experimental setup in which changing the scale of initialization strongly impacts the implicit regularization induced by SGD, interpolating from good generalization performance to completely memorizing the training set while making little progress on the test set. Moreover, we find that the extent and manner in which generalization ability is affected depends on the activation and loss function used, with sin activation demonstrating extreme memorization. In the case of the homogeneous Re LU activation, we show that this behavior can be attributed to the loss function. Our empirical investigation reveals that increasing the scale of initialization correlates with misalignment of representations and gradients across examples in the same class. This insight allows us to devise an alignment measure over gradients and representations which can capture this phenomenon. We demonstrate that our alignment measure correlates with generalization of deep models trained on image classification tasks.
Researcher Affiliation Collaboration Harsh Mehta Google Research harshm@google.com Ashok Cutkosky Boston University ashok@cutkosky.com Behnam Neyshabur Blueshift, Alphabet neyshabur@google.com
Pseudocode No The paper describes experimental procedures and derivations but does not include any formal pseudocode or algorithm blocks.
Open Source Code Yes The code used for experiments is open-sourced at https://github.com/google-research/ google-research/tree/master/extreme_memorization
Open Datasets Yes We observe this phenomenon on 3 different image classification datasets: CIFAR-10, CIFAR-100 and SVHN. ... CIFAR-10 dataset (Krizhevsky, 2009) ... CIFAR-100 (Krizhevsky et al.) ... The Street View House Numbers (SVHN) Dataset (Netzer et al., 2011)
Dataset Splits No The paper specifies training and test set sizes for CIFAR-10 and SVHN but does not mention the use of a separate validation set or details on how data was split for validation.
Hardware Specification Yes We employ a p100 single-instance GPU for each training run.
Software Dependencies No The paper mentions using the 'Tensorflow framework' but does not specify a version number for Tensorflow or any other software component, which is required for reproducibility.
Experiment Setup Yes In every experiment, we train using SGD, without momentum, with a constant learning rate of 0.01 and batch size of 256. ... In our 2-layer MLP model, in almost all cases we use 1024 units for the hidden layer... More details on the exact setup, datasets used and hyper-parameters are in the appendix.