reproducibilityindex.ai

Data Mixture Inference Attack: BPE Tokenizers Reveal Training Data Compositions

Authors: Jonathan Hayase, Alisa Liu, Yejin Choi, Sewoong Oh, Noah A. Smith

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In controlled experiments, we show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources.
Researcher Affiliation	Collaboration	University of Washington Allen Institute for AI {jhayase,alisaliu}@cs.washington.edu
Pseudocode	No	No explicit pseudocode or algorithm blocks were found in the paper.
Open Source Code	Yes	Code and detailed inferences available at https://github.com/alisawuffles/tokenizer-attack.
Open Datasets	Yes	Natural Language Mixtures We use the Oscar v23.01 corpus [1]... Programming Language Mixtures We use the Git Hub split of Red Pajama [22]... Domain Mixtures We consider...data from the Red Pajama dataset: Wikipedia,... Web,... Books from the Gutenberg Project and Books3 of The Pile, Code from Git Hub, and Academic, which contains La Te X files of scientific papers on Ar Xiv.
Dataset Splits	No	The paper describes how data is sampled for tokenizer training and merge frequency estimation, but does not specify explicit training, validation, and test splits for their linear programming model, as it is evaluated on controlled experiments with known ground truth.
Hardware Specification	No	We run all of our experiments on CPUs. For training tokenizers and calculating pair frequencies, we use 16 32 CPUs and a variable amount of memory (ranging from 4 GB to 64 GB) depending on the data.
Software Dependencies	No	We train tokenizers using the Hugging Face tokenizers library... To solve our linear programs, we use Gurobi [32].
Experiment Setup	Yes	We train tokenizers using the Hugging Face tokenizers library with a maximum vocabulary size of 30,000, and apply a minimal set of common pretokenization operations: we split on whitespace and only allow digits to be merged with other contiguous digits.