Emergent and Predictable Memorization in Large Language Models

Authors: Stella Biderman, USVSN PRASHANTH, Lintang Sutawika, Hailey Schoelkopf, Quentin Anthony, Shivanshu Purohit, Edward Raff

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We measure memorization in the Pythia model suite and plot scaling laws for forecasting memorization, allowing us to provide equi-compute recommendations to maximize the reliability (recall) of such predictions. We additionally provide further novel discoveries on the distribution of memorization scores across models and data. We release all code and data necessary to reproduce the results in this paper at https://github.com/Eleuther AI/pythia.
Researcher Affiliation Collaboration Stella Biderman1,2, USVSN Sai Prashanth2, Lintang Sutawika2, Hailey Schoelkopf2,3, Quentin Anthony2,4, Shivanshu Purohit5, and Edward Raff1,6 1Booz Allen Hamilton, 2Eleuther AI, 3Yale University, 4Ohio State University, 5Stability AI, 6University of Maryland, Baltimore County
Pseudocode No The paper does not contain any sections explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present any structured code-like blocks.
Open Source Code Yes We release all code and data necessary to reproduce the results in this paper at https://github.com/Eleuther AI/pythia.
Open Datasets Yes All of these suites were trained on the Pile [Gao et al., 2020, Biderman et al., 2022].
Dataset Splits No The paper does not explicitly provide specific percentages or counts for training, validation, and test dataset splits. It discusses training on sequences and evaluating memorization.
Hardware Specification No The paper states 'We are grateful to Stability AI for providing the compute required to carry out our experiments' but does not specify any particular GPU models, CPU types, or other specific hardware configurations used for the experiments.
Software Dependencies No The paper mentions using the 'GPT-Neo X library [Andonian et al., 2021]' for evaluation, but it does not specify the version number of this library or any other software dependencies like Python or PyTorch versions.
Experiment Setup Yes To ensure computational feasibility in our experiments, we choose k = 32 and evaluate the first 64 tokens from each sequence (we verify the robustness of this choice in Appendix A). Each sequence is a set of 2049 tokens, sampled from shuffled documents. These sequences are the input data points to the model during training. For most of our experiments, we choose to evaluate seven checkpoints spaced evenly throughout training. Specifically, we evaluate on checkpoints trained for (23 106), (44 106), (65 106), (85 106), (105 106), (126 106), and (146 106) sequences respectively, where these checkpoints approximately correspond to 7 checkpoints evenly spaced throughout training.