Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Unlearned but Not Forgotten: Data Extraction after Exact Unlearning in LLM

Authors: Xiaoyu Wu, Yifei Pang, Terrance Liu, Steven Z. Wu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our attack on several standard unlearning benchmarks, including MUSE [30], TOFU [23], and WMDP [18]. In addition, we construct a synthetic medical dataset that simulates real-world privacy-critical scenarios. Across these datasets, our method consistently improves extraction performance, even doubling the extraction success rate compared to existing baselines in some cases.
Researcher Affiliation	Academia	Xiaoyu Wu Rice University Houston, TX 77005 EMAIL Yifei Pang Carnegie Mellon University Pittsburgh, PA 15213 EMAIL Terrance Liu Carnegie Mellon University Pittsburgh, PA 15213 EMAIL Zhiwei Steven Wu Carnegie Mellon University Pittsburgh, PA 15213 EMAIL
Pseudocode	No	The paper describes the proposed method in Section 4 "Proposed Method" using mathematical equations (Eqs. 2, 3, 4, 5) and narrative text, along with a visualization in Figure 2. However, it does not include a distinct block, figure, or section explicitly labeled as "Pseudocode" or "Algorithm" containing step-by-step instructions in a code-like format.
Open Source Code	Yes	Code is publicly available at: https: //github.com/Nicholas0228/unlearned_data_extraction_llm.
Open Datasets	Yes	We evaluate unlearning methods on three datasets: the MUSE dataset [30], the TOFU dataset [23], and the WMDP dataset [18]. MUSE [30]: We use the MUSE-News dataset, which consists of BBC news articles collected after August 2023. TOFU [23]: We use the full TOFU dataset, which consists entirely of fictitious author biographies synthesized by GPT-4. WMDP [18]: We use a subset of bio-retain-corpus from WMDP, comprising a collection of Pub Med papers that span various categories within general biology.
Dataset Splits	Yes	Unless otherwise noted, we set the forgetting set size to 10% of the full dataset. MUSE [30]: The dataset is split into two disjoint subsets: Dforget and Dretain, containing 0.8M and 1.6M tokens, respectively. For a k% forgetting set, we randomly select passages from Dforget until the total number of selected tokens reaches 2.4M k%. TOFU [23]: To construct the forgetting set, we randomly sample question-answer pairs and treat the remaining data as the retaining set. WMDP [18]: We randomly sample sentences from this subset to form the forgetting set, with the remainder serving as the retaining set. Medical Dataset Experiment Details: We randomly sample 100 records as the forgetting set, with the remaining 900 records serving as the retaining set.
Hardware Specification	Yes	All experiments are conducted using two NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions using pre-trained LLMs like Llama2-7B and Phi-1.5, and fine-tuning with LoRA, but does not specify exact versions for programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or specific CUDA versions used in the experimental setup.
Experiment Setup	Yes	Unless otherwise noted, we set the forgetting set size to 10% of the full dataset. For our method, the guidance scale w is set to 2.0 for Phi and 1.4 for Llama, and the constraint level γ is set to 10 5 by default. We fine-tune LLa MA2-7B for 2 epochs and Phi-1.5 for 3 epochs on the full dataset, using a constant learning rate of 10 5.