Deduplicating Training Data Mitigates Privacy Risks in Language Models
Authors: Nikhil Kandpal, Eric Wallace, Colin Raffel
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we show that the success of these attacks is largely due to duplication in commonly used web-scraped training sets. We first show that the rate at which language models regenerate training sequences is superlinearly related to a sequence s count in the training set. For instance, a sequence that is present 10 times in the training data is on average generated 1000 more often than a sequence that is present only once. We next show that existing methods for detecting memorized sequences have near-chance accuracy on non-duplicated training sequences. Finally, we find that after applying methods to deduplicate training data, language models are considerably more secure against these types of privacy attacks. Taken together, our results motivate an increased focus on deduplication in privacy-sensitive applications and a reevaluation of the practicality of existing privacy attacks. |
| Researcher Affiliation | Collaboration | Nikhil Kandpal 1 Eric Wallace 2 Colin Raffel 1 1UNC Chapel Hill 2UC Berkeley. Correspondence to: Nikhil Kandpal <nkandpa2@cs.unc.edu>. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code used to perform our experiments can be found at https://github.com/nkandpa2/lm memorization. |
| Open Datasets | Yes | In our experiments, we use models trained on the widely-used Open Web Text (Gokaslan et al., 2019) and C4 (Raffel et al., 2020) datasets. |
| Dataset Splits | No | The paper discusses training data, generation, and analysis of samples but does not specify explicit train/validation/test dataset splits using percentages, counts, or references to standard splits. It mentions training, and then evaluating on generated samples. |
| Hardware Specification | No | The paper discusses various language models and datasets but does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions using GPT-2 small language model and zlib compression library, but does not specify version numbers for these or any other software dependencies. |
| Experiment Setup | No | The paper describes various aspects of the experimental methodology, such as sampling strategies (standard sampling, top-k sampling, temperature sampling) and sequence lengths, but it does not provide specific hyperparameter values like learning rates, batch sizes, or optimizer settings. It states, "We focus on unconditional generation using standard sampling, top-k sampling, and temperature sampling." It also mentions "For the rest of the paper we set N = 100 characters unless otherwise specified." |