Extreme Memorization via Scale of Initialization
Authors: Harsh Mehta, Ashok Cutkosky, Behnam Neyshabur
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We construct an experimental setup in which changing the scale of initialization strongly impacts the implicit regularization induced by SGD, interpolating from good generalization performance to completely memorizing the training set while making little progress on the test set. Moreover, we find that the extent and manner in which generalization ability is affected depends on the activation and loss function used, with sin activation demonstrating extreme memorization. In the case of the homogeneous Re LU activation, we show that this behavior can be attributed to the loss function. Our empirical investigation reveals that increasing the scale of initialization correlates with misalignment of representations and gradients across examples in the same class. This insight allows us to devise an alignment measure over gradients and representations which can capture this phenomenon. We demonstrate that our alignment measure correlates with generalization of deep models trained on image classification tasks. |
| Researcher Affiliation | Collaboration | Harsh Mehta Google Research harshm@google.com Ashok Cutkosky Boston University ashok@cutkosky.com Behnam Neyshabur Blueshift, Alphabet neyshabur@google.com |
| Pseudocode | No | The paper describes experimental procedures and derivations but does not include any formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code used for experiments is open-sourced at https://github.com/google-research/ google-research/tree/master/extreme_memorization |
| Open Datasets | Yes | We observe this phenomenon on 3 different image classification datasets: CIFAR-10, CIFAR-100 and SVHN. ... CIFAR-10 dataset (Krizhevsky, 2009) ... CIFAR-100 (Krizhevsky et al.) ... The Street View House Numbers (SVHN) Dataset (Netzer et al., 2011) |
| Dataset Splits | No | The paper specifies training and test set sizes for CIFAR-10 and SVHN but does not mention the use of a separate validation set or details on how data was split for validation. |
| Hardware Specification | Yes | We employ a p100 single-instance GPU for each training run. |
| Software Dependencies | No | The paper mentions using the 'Tensorflow framework' but does not specify a version number for Tensorflow or any other software component, which is required for reproducibility. |
| Experiment Setup | Yes | In every experiment, we train using SGD, without momentum, with a constant learning rate of 0.01 and batch size of 256. ... In our 2-layer MLP model, in almost all cases we use 1024 units for the hidden layer... More details on the exact setup, datasets used and hyper-parameters are in the appendix. |