Data Mixture Inference Attack: BPE Tokenizers Reveal Training Data Compositions
Authors: Jonathan Hayase, Alisa Liu, Yejin Choi, Sewoong Oh, Noah A. Smith
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In controlled experiments, we show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources. |
| Researcher Affiliation | Collaboration | University of Washington Allen Institute for AI {jhayase,alisaliu}@cs.washington.edu |
| Pseudocode | No | No explicit pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | Code and detailed inferences available at https://github.com/alisawuffles/tokenizer-attack. |
| Open Datasets | Yes | Natural Language Mixtures We use the Oscar v23.01 corpus [1]... Programming Language Mixtures We use the Git Hub split of Red Pajama [22]... Domain Mixtures We consider...data from the Red Pajama dataset: Wikipedia,... Web,... Books from the Gutenberg Project and Books3 of The Pile, Code from Git Hub, and Academic, which contains La Te X files of scientific papers on Ar Xiv. |
| Dataset Splits | No | The paper describes how data is sampled for tokenizer training and merge frequency estimation, but does not specify explicit training, validation, and test splits for their linear programming model, as it is evaluated on controlled experiments with known ground truth. |
| Hardware Specification | No | We run all of our experiments on CPUs. For training tokenizers and calculating pair frequencies, we use 16 32 CPUs and a variable amount of memory (ranging from 4 GB to 64 GB) depending on the data. |
| Software Dependencies | No | We train tokenizers using the Hugging Face tokenizers library... To solve our linear programs, we use Gurobi [32]. |
| Experiment Setup | Yes | We train tokenizers using the Hugging Face tokenizers library with a maximum vocabulary size of 30,000, and apply a minimal set of common pretokenization operations: we split on whitespace and only allow digits to be merged with other contiguous digits. |