Deduplicating Training Data Mitigates Privacy Risks in Language Models

Authors: Nikhil Kandpal, Eric Wallace, Colin Raffel

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we show that the success of these attacks is largely due to duplication in commonly used web-scraped training sets. We first show that the rate at which language models regenerate training sequences is superlinearly related to a sequence s count in the training set. For instance, a sequence that is present 10 times in the training data is on average generated 1000 more often than a sequence that is present only once. We next show that existing methods for detecting memorized sequences have near-chance accuracy on non-duplicated training sequences. Finally, we find that after applying methods to deduplicate training data, language models are considerably more secure against these types of privacy attacks. Taken together, our results motivate an increased focus on deduplication in privacy-sensitive applications and a reevaluation of the practicality of existing privacy attacks.
Researcher Affiliation Collaboration Nikhil Kandpal 1 Eric Wallace 2 Colin Raffel 1 1UNC Chapel Hill 2UC Berkeley. Correspondence to: Nikhil Kandpal <nkandpa2@cs.unc.edu>.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The code used to perform our experiments can be found at https://github.com/nkandpa2/lm memorization.
Open Datasets Yes In our experiments, we use models trained on the widely-used Open Web Text (Gokaslan et al., 2019) and C4 (Raffel et al., 2020) datasets.
Dataset Splits No The paper discusses training data, generation, and analysis of samples but does not specify explicit train/validation/test dataset splits using percentages, counts, or references to standard splits. It mentions training, and then evaluating on generated samples.
Hardware Specification No The paper discusses various language models and datasets but does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments.
Software Dependencies No The paper mentions using GPT-2 small language model and zlib compression library, but does not specify version numbers for these or any other software dependencies.
Experiment Setup No The paper describes various aspects of the experimental methodology, such as sampling strategies (standard sampling, top-k sampling, temperature sampling) and sequence lengths, but it does not provide specific hyperparameter values like learning rates, batch sizes, or optimizer settings. It states, "We focus on unconditional generation using standard sampling, top-k sampling, and temperature sampling." It also mentions "For the rest of the paper we set N = 100 characters unless otherwise specified."