Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Rethinking the Role of Verbatim Memorization in LLM Privacy

Authors: Tom Sander, Bargav Jayaraman, Mark Ibrahim, Kamalika Chaudhuri, Chuan Guo

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we fine-tune language models on synthetically generated biographical information including PIIs, and try to extract them in different ways after instruction fine-tuning. We find counter to conventional wisdom that better verbatim memorization does not necessarily increase data leakage via chat. We also find that it is easier to extract information via chat from an LLM that is better able to manipulate and process knowledge even if it is smaller, and that not all attributes are equally extractable. This suggests that the relationship between privacy, memorization and language understanding of LLMs is very intricate, and that examining memorization in isolation can lead to misleading conclusions.
Researcher Affiliation Industry Tom Sander Meta FAIR Bargav Jayaraman Oracle Labs Mark Ibrahim Meta FAIR Chuan Guo Meta FAIR Kamalika Chaudhuri Meta FAIR
Pseudocode No The paper describes the methodology and experimental procedures in prose, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We could not open source everything yet.
Open Datasets No Our experiments are conducted in a controlled setting using synthetic data and may not reflect how production models are trained on privacy-sensitive data. We train on 100,000 individuals with different names. From Zhu & Li (2023), we keep synthetically generated birth dates, cities of birth, cities of work, company names, university names, and fields of study. We add unique numeric identifiers (8 digits) and unique alphabetic identifiers (8 letters in a-z + A-Z) to serve as unique PIIs.
Dataset Splits Yes We divide the biography data into 10 equal-sized buckets (10,000 BIOs per bucket): B1, B2, ..., B10 and create BIOs-augment, where each record in bucket Bi also has i repetitions, but with varied templates that randomize sentence order and formulations: it results in a dataset with 550,000 biographies ( 74M tokens). Records in bucket B1 occur only once and form the tail set, while those in B10, the most frequent set, form the head set. ... Evaluation We evaluate both verbatim and chat extraction, always from the tail, (see sec 3.2). ... Instruction-tuning uses random question answer pairs with information from the head, and memorization evaluation involves extracting information from the tail only.
Hardware Specification Yes We utilize our internal cluster equipped with A-100 GPUs, each with 80GB of memory.
Software Dependencies No We use the Open RLHF (Hu et al., 2024) library for training and instruction tuning of Llama1 (Touvron et al., 2023a), Llama-2 (Touvron et al., 2023b), and Llama-3 (Dubey et al., 2024) models.
Experiment Setup Yes We train for up to 20 epochs, using a batch size of 128 biographies, which corresponds to 17k tokens per batch. ... We train for 10 epochs, using a batch size of 128 question-answer pairs, which corresponds to 3,778 tokens per batch. ... We use default training parameters of the Open RLHF library: base learning rate of 5 x 10^-6 with cosine schedule, and each model s default tokenizer. ... The corresponding answers are obtained using random decoding with temperature = 1.0 and top-p = 0.9 (as greedy decoding is uncommon in chatbots).