Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

TokenSwap: A Lightweight Method to Disrupt Memorized Sequences in LLMs

Authors: Parjanya Prashant, Kaustubh Ponkshe, Babak Salimi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluations on Pythia-6.9B and Llama-3-8B show up to a 10 drop in exact memorization with negligible task degradation. Our method offers a practical, accessible solution for mitigating memorized generation in deployed LLMs. Code is available at https://github.com/parjanya20/verbatim-llm. We extensively evaluate TOKENSWAP through both controlled experiments and real-world deployments. In controlled fine-tuning experiments (Section 4.1), TOKENSWAP achieves a 50-800 reduction in verbatim generation compared to undefended models. Evaluations on commercial-grade models such as Pythia-6.9b and Llama-3-8b (Section 4.2) demonstrate reductions in verbatim generation by upto 10 , without compromising downstream task performance.
Researcher Affiliation	Academia	Parjanya Prajakta Prashant UC San Diego Kaustubh Ponkshe EPFL Babak Salimi UC San Diego Equal Contribution; Correspondence to Parjanya Prajakta Prashant <EMAIL>, Kaustubh Ponkshe <EMAIL>
Pseudocode	Yes	Algorithm 1 TOKENSWAP Require: Main model pmain, auxiliary model paux, token subset G, prompt x<0 1: for i = 0, 1, . . . do 2: pmain i pmain( \|x<i) {Get main model probabilities} 3: paux i paux( \|x<i) {Get auxiliary model probabilities} v G pmain i [v] P v G paux i [v] {Compute normalization} 5: for v V do 6: pfinal i [v] pmain i [v], if v / G α paux i [v], if v G 7: end for 8: xi pfinal i {Sample next token} 9: end for
Open Source Code	Yes	Code is available at https://github.com/parjanya20/verbatim-llm.
Open Datasets	Yes	Pile-Memorized Dataset For Pythia-6.9B, we evaluate on memorized sequences identified by Chang et al. [16] from the Pile dataset, consisting of 32-token prefixes and 48-token suffixes. After filtering to retain only natural language content (excluding code, URLs, etc.), we obtain 184 evaluation examples. Leet Code Dataset For Llama-3-8B, following previous work demonstrating Leet Code problem memorization [38], we evaluate on 1,825 Leet Code problem statements [30]. These problem statements are written in natural language. We evaluate TOKENSWAP on OLMo-2-13B [49], which is trained on the open Dolma dataset [62]. For our experiments, we use the Auto Math Text dataset , referred to as Math Abstracts in the tables, which aggregates mathematical content from diverse sources including ar Xiv, Open Web Math, Red Pajama, and Algebraic Stack. The titles in this corpus were generated using the Qwen-72B language model. Additionally, we use the Writing Prompts dataset (Fan et al., 2018), which contains user-generated stories based on provided premises from a Reddit community.
Dataset Splits	Yes	Following Abad et al. [1], we fine-tune a Llama-3.2-3B model [24] on 2,000-sequence subsets from two datasets: Math Abstracts [69] and Writing Stories [27]. We train for 50 epochs to deliberately amplify memorization beyond typical levels. For all experiments and methods, a prefix of 20 tokens is used and the next 128 tokens are greedily sampled. For both datasets, we randomly sample 2,000 training examples with a fixed seed to ensure consistent training across all models. We further sample 500 distinct points for evaluation, during which we generate sequences of 128 tokens.
Hardware Specification	Yes	Experiments were conducted using a single NVIDIA A6000 GPU, ensuring efficiency in training and inference without excessive computational overhead.
Software Dependencies	No	We implement our method in Py Torch and Hugging Face. Optimizer: Adam W with default parameters. Starting with the 500 most frequent tokens from COCA [23], we apply NLTK [43] part-of-speech filtering to retain: Core grammatical elements: determiners (DT), prepositions (IN), conjunctions (CC) Pronouns (PRP, PRP$) and modal verbs (MD) Question-related tokens: wh-words (WDT, WP, WRB) Auxiliary verbs: be, do, have.
Experiment Setup	Yes	The training and evaluation phases were configured with the following hyperparameters. The hyperparameters were taken from previous work, used as a baseline [1]: Sequence Length: 2048 tokens Batch Size: 1 Learning Rate: 5 10 5 Optimizer: Adam W with default parameters Gradient Accumulation Steps: 1 Warmup Steps: 50