Can Neural Network Memorization Be Localized?
Authors: Pratyush Maini, Michael Curtis Mozer, Hanie Sedghi, Zachary Chase Lipton, J Zico Kolter, Chiyuan Zhang
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | First, via three experimental sources of converging evidence, we find that most layers are redundant for the memorization of examples and the layers that contribute to example memorization are, in general, not the final layers. The three sources are gradient accounting (measuring the contribution to the gradient norms from memorized and clean examples), layer rewinding (replacing specific model weights of a converged model with previous training checkpoints), and retraining (training rewound layers only on clean examples). |
| Researcher Affiliation | Collaboration | 1School of Computer Science, Carnegie Mellon University 2Google Research, Mountain View, CA. |
| Pseudocode | No | The paper describes the steps of its methods in narrative text and includes mathematical formulations, but it does not contain any clearly labeled pseudocode or algorithm blocks/figures. |
| Open Source Code | Yes | Code for reproducing our experiments can be found at https://github.com/pratyushmaini/localizing-memorization. |
| Open Datasets | Yes | We perform experiments on three image classification datasets, CIFAR-10 (Krizhevsky, 2009), MNIST (Deng, 2012), and SVHN (Netzer et al., 2011). |
| Dataset Splits | No | The paper mentions training models and evaluating them on clean and noisy data subsets and also on a 'Test' set, but it does not explicitly provide details about a validation split, nor specific percentages or counts for training, validation, or test sets. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments, such as GPU or CPU models, memory, or cloud instance types. |
| Software Dependencies | No | The paper mentions the use of 'Py Torch implementation' when discussing layer grouping, but it does not specify version numbers for PyTorch or any other software dependencies required to replicate the experiments. |
| Experiment Setup | Yes | Training Parameters. We use the one-cycle learning rate (Smith, 2017) and train our models for 50 epochs using SGD optimizer. The peak learning rate for the cyclic scheduler is set to 0.1 at the 10th epoch, and the training batch size is 512. Unless specified, we add 10% random label noise to the dataset: that is, we flip the label of 10% examples to an incorrect class chosen at random. |