Quantifying Memorization Across Neural Language Models

Authors: Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, Chiyuan Zhang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We describe three log-linear relationships that quantify the degree to which LMs emit memorized training data. We first construct a set of prompts from the model s training set. By feeding prefixes of these prompts into the trained model, we check whether the model has the ability to complete the rest of the example verbatim. This allows us to measure memorization across models, datasets, and prompts of varying sizes. 4 EXPERIMENTS
Researcher Affiliation Collaboration 1Google Research 2University of Pennsylvania 3Cornell University
Pseudocode No The paper describes methods textually (e.g., in Section 3.2 and Appendix A) but does not present any formal pseudocode or algorithm blocks.
Open Source Code No The paper references existing open-source models and their repositories (e.g., GPT-Neo, GPT-J-6B) but does not state that the code for the authors' specific analysis or methodology is open-source or provided.
Open Datasets Yes We primarily study the GPT-Neo model family (Black et al., 2021; Wang and Komatsuzaki, 2021) trained on the Pile dataset (Gao et al., 2020). The Pile: An 800GB dataset of diverse text for language modeling. ar Xiv preprint ar Xiv:2101.00027, 2020. The T5 v1.1 models... trained on C4 a 806 GB curated version of English web pages from the Common Crawl. Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." Journal of Machine Learning Research 21 (2020).
Dataset Splits No The paper uses pre-trained models and samples data for evaluation. It does not define specific training, validation, or test splits for reproducing model training from scratch. It states 'We query on a smaller subset of the training data, that still produces statistically confident estimates. In this paper we randomly choose subsets of roughly 50,000 sequences' for evaluation, but this is not a general validation split.
Hardware Specification No For example, the largest 6 billion parameter GPT-Neo model has a throughput of roughly one 100-token generation per second on a V100 GPU. This is a general statement about throughput measurement, not a specification of the hardware used for the authors' experiments.
Software Dependencies No The paper refers to specific models (e.g., 'GPT-Neo models', 'T5 v1.1 models', 'OPT family of models') and cites their original papers/repositories. However, it does not provide a list of specific software dependencies with version numbers (e.g., Python version, PyTorch/TensorFlow version, specific library versions) for the analysis code they implemented.
Experiment Setup No The paper describes the evaluation procedure (e.g., prompting with prefixes, checking for verbatim completion, greedy decoding) and data sampling methods. However, it does not provide specific hyperparameters or detailed system-level settings for their evaluation scripts (e.g., batch size for inference, memory configurations).