Data Mixture Inference Attack: BPE Tokenizers Reveal Training Data Compositions

Authors: Jonathan Hayase, Alisa Liu, Yejin Choi, Sewoong Oh, Noah A. Smith

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In controlled experiments, we show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources.
Researcher Affiliation Collaboration University of Washington Allen Institute for AI {jhayase,alisaliu}@cs.washington.edu
Pseudocode No No explicit pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes Code and detailed inferences available at https://github.com/alisawuffles/tokenizer-attack.
Open Datasets Yes Natural Language Mixtures We use the Oscar v23.01 corpus [1]... Programming Language Mixtures We use the Git Hub split of Red Pajama [22]... Domain Mixtures We consider...data from the Red Pajama dataset: Wikipedia,... Web,... Books from the Gutenberg Project and Books3 of The Pile, Code from Git Hub, and Academic, which contains La Te X files of scientific papers on Ar Xiv.
Dataset Splits No The paper describes how data is sampled for tokenizer training and merge frequency estimation, but does not specify explicit training, validation, and test splits for their linear programming model, as it is evaluated on controlled experiments with known ground truth.
Hardware Specification No We run all of our experiments on CPUs. For training tokenizers and calculating pair frequencies, we use 16 32 CPUs and a variable amount of memory (ranging from 4 GB to 64 GB) depending on the data.
Software Dependencies No We train tokenizers using the Hugging Face tokenizers library... To solve our linear programs, we use Gurobi [32].
Experiment Setup Yes We train tokenizers using the Hugging Face tokenizers library with a maximum vocabulary size of 30,000, and apply a minimal set of common pretokenization operations: we split on whitespace and only allow digits to be merged with other contiguous digits.