reproducibilityindex.ai

Extreme Memorization via Scale of Initialization

Authors: Harsh Mehta, Ashok Cutkosky, Behnam Neyshabur

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We construct an experimental setup in which changing the scale of initialization strongly impacts the implicit regularization induced by SGD, interpolating from good generalization performance to completely memorizing the training set while making little progress on the test set. Moreover, we ﬁnd that the extent and manner in which generalization ability is affected depends on the activation and loss function used, with sin activation demonstrating extreme memorization. In the case of the homogeneous Re LU activation, we show that this behavior can be attributed to the loss function. Our empirical investigation reveals that increasing the scale of initialization correlates with misalignment of representations and gradients across examples in the same class. This insight allows us to devise an alignment measure over gradients and representations which can capture this phenomenon. We demonstrate that our alignment measure correlates with generalization of deep models trained on image classiﬁcation tasks.
Researcher Affiliation	Collaboration	Harsh Mehta Google Research harshm@google.com Ashok Cutkosky Boston University ashok@cutkosky.com Behnam Neyshabur Blueshift, Alphabet neyshabur@google.com
Pseudocode	No	The paper describes experimental procedures and derivations but does not include any formal pseudocode or algorithm blocks.
Open Source Code	Yes	The code used for experiments is open-sourced at https://github.com/google-research/ google-research/tree/master/extreme_memorization
Open Datasets	Yes	We observe this phenomenon on 3 different image classiﬁcation datasets: CIFAR-10, CIFAR-100 and SVHN. ... CIFAR-10 dataset (Krizhevsky, 2009) ... CIFAR-100 (Krizhevsky et al.) ... The Street View House Numbers (SVHN) Dataset (Netzer et al., 2011)
Dataset Splits	No	The paper specifies training and test set sizes for CIFAR-10 and SVHN but does not mention the use of a separate validation set or details on how data was split for validation.
Hardware Specification	Yes	We employ a p100 single-instance GPU for each training run.
Software Dependencies	No	The paper mentions using the 'Tensorﬂow framework' but does not specify a version number for Tensorflow or any other software component, which is required for reproducibility.
Experiment Setup	Yes	In every experiment, we train using SGD, without momentum, with a constant learning rate of 0.01 and batch size of 256. ... In our 2-layer MLP model, in almost all cases we use 1024 units for the hidden layer... More details on the exact setup, datasets used and hyper-parameters are in the appendix.