Quantifying Memorization Across Neural Language Models
Authors: Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, Chiyuan Zhang
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We describe three log-linear relationships that quantify the degree to which LMs emit memorized training data. We first construct a set of prompts from the model s training set. By feeding prefixes of these prompts into the trained model, we check whether the model has the ability to complete the rest of the example verbatim. This allows us to measure memorization across models, datasets, and prompts of varying sizes. 4 EXPERIMENTS |
| Researcher Affiliation | Collaboration | 1Google Research 2University of Pennsylvania 3Cornell University |
| Pseudocode | No | The paper describes methods textually (e.g., in Section 3.2 and Appendix A) but does not present any formal pseudocode or algorithm blocks. |
| Open Source Code | No | The paper references existing open-source models and their repositories (e.g., GPT-Neo, GPT-J-6B) but does not state that the code for the authors' specific analysis or methodology is open-source or provided. |
| Open Datasets | Yes | We primarily study the GPT-Neo model family (Black et al., 2021; Wang and Komatsuzaki, 2021) trained on the Pile dataset (Gao et al., 2020). The Pile: An 800GB dataset of diverse text for language modeling. ar Xiv preprint ar Xiv:2101.00027, 2020. The T5 v1.1 models... trained on C4 a 806 GB curated version of English web pages from the Common Crawl. Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." Journal of Machine Learning Research 21 (2020). |
| Dataset Splits | No | The paper uses pre-trained models and samples data for evaluation. It does not define specific training, validation, or test splits for reproducing model training from scratch. It states 'We query on a smaller subset of the training data, that still produces statistically confident estimates. In this paper we randomly choose subsets of roughly 50,000 sequences' for evaluation, but this is not a general validation split. |
| Hardware Specification | No | For example, the largest 6 billion parameter GPT-Neo model has a throughput of roughly one 100-token generation per second on a V100 GPU. This is a general statement about throughput measurement, not a specification of the hardware used for the authors' experiments. |
| Software Dependencies | No | The paper refers to specific models (e.g., 'GPT-Neo models', 'T5 v1.1 models', 'OPT family of models') and cites their original papers/repositories. However, it does not provide a list of specific software dependencies with version numbers (e.g., Python version, PyTorch/TensorFlow version, specific library versions) for the analysis code they implemented. |
| Experiment Setup | No | The paper describes the evaluation procedure (e.g., prompting with prefixes, checking for verbatim completion, greedy decoding) and data sampling methods. However, it does not provide specific hyperparameters or detailed system-level settings for their evaluation scripts (e.g., batch size for inference, memory configurations). |