The Curious Case of Benign Memorization

Authors: Sotiris Anagnostidis, Gregor Bachmann, Lorenzo Noci, Thomas Hofmann

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that deep models have the surprising ability to separate noise from signal by distributing the task of memorization and feature learning to different layers. As a result, only the very last layers are used for memorization, while preceding layers encode performant features which remain largely unaffected by the label noise. We explore the intricate role of the augmentations used for training and identify a memorization-generalization trade-off in terms of their diversity, marking a clear distinction to all previous works. Finally, we give a first explanation for the emergence of benign memorization by showing that malign memorization under data augmentation is infeasible due to the insufficient capacity of the model for the increased sample size.
Researcher Affiliation Academia Sotiris Anagnostidis , Gregor Bachmann , Lorenzo Noci , Thomas Hofmann Department of Computer Science ETH Z urich, Switzerland {sotirios.anagnostidis,gregor.bachmann,lorenzo.noci}@inf.ethz.ch
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code Yes We have also released the code as part of the supplementary material, including scripts on how to reproduce our results.
Open Datasets Yes We use the standard vision datasets CIFAR10 and CIFAR100 (Krizhevsky & Hinton, 2009), as well as Tiny Image Net (Le & Yang, 2015). For more details, we refer the reader to Appendix F. Table 5: Statistics for the datasets used. CIFAR-10 Examples in train split 50000. CIFAR-100 Examples in train split 50000. Tiny Image Net Examples in train split 100000.
Dataset Splits No The paper does not explicitly mention a separate validation set split percentage or size. Table 5 only specifies train and test splits.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for the experiments. It only mentions the use of Res Net and VGG architectures.
Software Dependencies No Table 6 lists hyperparameters, but no specific software versions (e.g., Python, PyTorch, TensorFlow versions) are mentioned.
Experiment Setup Yes Table 6: Hyperparameters for the random labels experiments. Augmentations (Random cropping scale (0.08, 1), Horizontal flip probability 0.5, Color-Jittering (0.8, 0.8, 0.8, 0.2), Grayscale probability 0.2, Mixup yes, image-size 32/64, clip-norm none, dropout none, Projector size 65536, Projector MLP normalization none), Learning rate 4E-3, Learning rate scheduler none, Adam (β1, β2) (0.9, 0.999), Batch size 256, Weight decay 0.0, Loss MSE-loss.