Can Neural Network Memorization Be Localized?

Authors: Pratyush Maini, Michael Curtis Mozer, Hanie Sedghi, Zachary Chase Lipton, J Zico Kolter, Chiyuan Zhang

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental First, via three experimental sources of converging evidence, we find that most layers are redundant for the memorization of examples and the layers that contribute to example memorization are, in general, not the final layers. The three sources are gradient accounting (measuring the contribution to the gradient norms from memorized and clean examples), layer rewinding (replacing specific model weights of a converged model with previous training checkpoints), and retraining (training rewound layers only on clean examples).
Researcher Affiliation Collaboration 1School of Computer Science, Carnegie Mellon University 2Google Research, Mountain View, CA.
Pseudocode No The paper describes the steps of its methods in narrative text and includes mathematical formulations, but it does not contain any clearly labeled pseudocode or algorithm blocks/figures.
Open Source Code Yes Code for reproducing our experiments can be found at https://github.com/pratyushmaini/localizing-memorization.
Open Datasets Yes We perform experiments on three image classification datasets, CIFAR-10 (Krizhevsky, 2009), MNIST (Deng, 2012), and SVHN (Netzer et al., 2011).
Dataset Splits No The paper mentions training models and evaluating them on clean and noisy data subsets and also on a 'Test' set, but it does not explicitly provide details about a validation split, nor specific percentages or counts for training, validation, or test sets.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments, such as GPU or CPU models, memory, or cloud instance types.
Software Dependencies No The paper mentions the use of 'Py Torch implementation' when discussing layer grouping, but it does not specify version numbers for PyTorch or any other software dependencies required to replicate the experiments.
Experiment Setup Yes Training Parameters. We use the one-cycle learning rate (Smith, 2017) and train our models for 50 epochs using SGD optimizer. The peak learning rate for the cyclic scheduler is set to 0.1 at the 10th epoch, and the training batch size is 512. Unless specified, we add 10% random label noise to the dataset: that is, we flip the label of 10% examples to an incorrect class chosen at random.