reproducibilityindex.ai

Emergent and Predictable Memorization in Large Language Models

Authors: Stella Biderman, USVSN PRASHANTH, Lintang Sutawika, Hailey Schoelkopf, Quentin Anthony, Shivanshu Purohit, Edward Raff

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We measure memorization in the Pythia model suite and plot scaling laws for forecasting memorization, allowing us to provide equi-compute recommendations to maximize the reliability (recall) of such predictions. We additionally provide further novel discoveries on the distribution of memorization scores across models and data. We release all code and data necessary to reproduce the results in this paper at https://github.com/Eleuther AI/pythia.
Researcher Affiliation	Collaboration	Stella Biderman1,2, USVSN Sai Prashanth2, Lintang Sutawika2, Hailey Schoelkopf2,3, Quentin Anthony2,4, Shivanshu Purohit5, and Edward Raff1,6 1Booz Allen Hamilton, 2Eleuther AI, 3Yale University, 4Ohio State University, 5Stability AI, 6University of Maryland, Baltimore County
Pseudocode	No	The paper does not contain any sections explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present any structured code-like blocks.
Open Source Code	Yes	We release all code and data necessary to reproduce the results in this paper at https://github.com/Eleuther AI/pythia.
Open Datasets	Yes	All of these suites were trained on the Pile [Gao et al., 2020, Biderman et al., 2022].
Dataset Splits	No	The paper does not explicitly provide specific percentages or counts for training, validation, and test dataset splits. It discusses training on sequences and evaluating memorization.
Hardware Specification	No	The paper states 'We are grateful to Stability AI for providing the compute required to carry out our experiments' but does not specify any particular GPU models, CPU types, or other specific hardware configurations used for the experiments.
Software Dependencies	No	The paper mentions using the 'GPT-Neo X library [Andonian et al., 2021]' for evaluation, but it does not specify the version number of this library or any other software dependencies like Python or PyTorch versions.
Experiment Setup	Yes	To ensure computational feasibility in our experiments, we choose k = 32 and evaluate the first 64 tokens from each sequence (we verify the robustness of this choice in Appendix A). Each sequence is a set of 2049 tokens, sampled from shuffled documents. These sequences are the input data points to the model during training. For most of our experiments, we choose to evaluate seven checkpoints spaced evenly throughout training. Specifically, we evaluate on checkpoints trained for (23 106), (44 106), (65 106), (85 106), (105 106), (126 106), and (146 106) sequences respectively, where these checkpoints approximately correspond to 7 checkpoints evenly spaced throughout training.