The Curious Case of Benign Memorization
Authors: Sotiris Anagnostidis, Gregor Bachmann, Lorenzo Noci, Thomas Hofmann
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that deep models have the surprising ability to separate noise from signal by distributing the task of memorization and feature learning to different layers. As a result, only the very last layers are used for memorization, while preceding layers encode performant features which remain largely unaffected by the label noise. We explore the intricate role of the augmentations used for training and identify a memorization-generalization trade-off in terms of their diversity, marking a clear distinction to all previous works. Finally, we give a first explanation for the emergence of benign memorization by showing that malign memorization under data augmentation is infeasible due to the insufficient capacity of the model for the increased sample size. |
| Researcher Affiliation | Academia | Sotiris Anagnostidis , Gregor Bachmann , Lorenzo Noci , Thomas Hofmann Department of Computer Science ETH Z urich, Switzerland {sotirios.anagnostidis,gregor.bachmann,lorenzo.noci}@inf.ethz.ch |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | Yes | We have also released the code as part of the supplementary material, including scripts on how to reproduce our results. |
| Open Datasets | Yes | We use the standard vision datasets CIFAR10 and CIFAR100 (Krizhevsky & Hinton, 2009), as well as Tiny Image Net (Le & Yang, 2015). For more details, we refer the reader to Appendix F. Table 5: Statistics for the datasets used. CIFAR-10 Examples in train split 50000. CIFAR-100 Examples in train split 50000. Tiny Image Net Examples in train split 100000. |
| Dataset Splits | No | The paper does not explicitly mention a separate validation set split percentage or size. Table 5 only specifies train and test splits. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for the experiments. It only mentions the use of Res Net and VGG architectures. |
| Software Dependencies | No | Table 6 lists hyperparameters, but no specific software versions (e.g., Python, PyTorch, TensorFlow versions) are mentioned. |
| Experiment Setup | Yes | Table 6: Hyperparameters for the random labels experiments. Augmentations (Random cropping scale (0.08, 1), Horizontal flip probability 0.5, Color-Jittering (0.8, 0.8, 0.8, 0.2), Grayscale probability 0.2, Mixup yes, image-size 32/64, clip-norm none, dropout none, Projector size 65536, Projector MLP normalization none), Learning rate 4E-3, Learning rate scheduler none, Adam (β1, β2) (0.9, 0.999), Batch size 256, Weight decay 0.0, Loss MSE-loss. |