Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon

Authors: USVSN Sai Prashanth, Alvin Deng, Kyle O'Brien, Jyothir S V, Mohammad Aflah Khan, Jaydeep Borkar, Christopher Choquette-Choo, Jacob Fuehne, Stella R Biderman, Tracy Ke, Katherine Lee, Naomi Saphra

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We use our taxonomy in a variety of experiments that highlight the multifaceted nature of memorization. In summary: We introduce an intuitive taxonomy and heuristics for categorizing memorized data. By comparing memorized and unmemorized distributions, we assess how a variety of corpus-wide statistics, datum-level metrics, and representational differences influence the likelihood of a given sequence being memorized. Our dependency tests confirm existing findings that low perplexity is strongly associated with memorization though not equally for all memorized examples. This fact guides our heuristic for partitioning memorized data into a recitation category. We study scaling factors in memorization by monitoring each taxonomic category over the course of training and across model sizes. To demonstrate the value of our taxonomy, we train logistic regressions to predict the likelihood of memorization for candidate sequences from each memorization category.
Researcher Affiliation Collaboration 1Eleuther AI 2Microsoft 3New York University 4Datology AI 5Northeastern University 6MPI-SWS 7IIIT Delhi 8Google Deep Mind 9University of Illinois at Urbana-Champaign 10Harvard University 11Kempner Institute
Pseudocode No The paper describes the steps for checking incrementing and repeating templates in Appendix A.4 (e.g., 'To check for an incrementing sequence, we perform the following steps:', 'We perform the following steps to check for repeating sequences:'), but these are described in natural language bullet points, not as structured pseudocode or algorithm blocks with formal syntax.
Open Source Code No The paper does not contain any explicit statement about releasing their own source code for the methodology described, nor does it provide a direct link to a code repository. It mentions using 'deduplicated Pythia models (Biderman et al., 2023b)' and 'Distil Bert (Sanh et al., 2020)', which refers to third-party tools or models, not their own implementation code.
Open Datasets Yes Our memorized sample is a public list of sequences memorized by Pythia, released by Biderman et al. (2023a). ... this dataset contains all 32-extractable samples from the Pile, verified by referencing the training data (Gao et al., 2020).
Dataset Splits Yes We split the representative sample into test and train sets. We then combine the train set with the full memorized sample, reserving a portion as a validation set.
Hardware Specification No The Acknowledgments section states: 'This work was enabled in part by a gift from the Chan Zuckerberg Initiative Foundation to establish the Kempner Institute for the Study of Natural and Artificial Intelligence. We would like to thank Eleuther AI and Core Weave for providing the computing resources used in this paper.' This statement mentions the providers of computing resources but does not specify any particular hardware models (e.g., specific GPUs, CPUs, or memory).
Software Dependencies No The paper mentions 'Distil Bert (Sanh et al., 2020)' for a classifier and 'Huffman Coding (Huffman, 1952)' for compressibility. While it names software/algorithms, it does not provide specific version numbers for *any* of the key software components or libraries used for implementing their methodology (e.g., Python, PyTorch, TensorFlow, scikit-learn versions).
Experiment Setup Yes Each model is a logistic regression trained with L2 regularization, a bias parameter, and balanced class weights. ... To train a Natural Language vs Code classifier, we fine-tune Distil Bert (Sanh et al., 2020) on uniformly random sampled Bookcorpus (Zhu et al., 2015) and github-code datasets. We train it with learning rate of 10-7 and batch size of 256 for a total of 1000 steps and observe validation f1 score of 0.9950 on a held of evaluation set.