Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Constrained Entropic Unlearning: A Primal-Dual Framework for Large Language Models

Authors: Taha Entesari, Arman Hatami, Rinat Khaziev, Anil Ramakrishna, Mahyar Fazlyab

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluations on the TOFU and MUSE benchmarks across diverse LLM architectures demonstrate that our approach consistently matches or exceeds stateof-the-art baselines, effectively removing targeted information while preserving downstream utility.
Researcher Affiliation	Collaboration	1 Johns Hopkins University, 2 Amazon
Pseudocode	Yes	Algorithm 1 Primal-Dual Solver with Warm Starting (Problem (2)
Open Source Code	Yes	Moreover, our algorithm is implemented in this repository and made public at https://github.com/locuslab/open-unlearning.
Open Datasets	Yes	We evaluated our unlearning methodology on two established benchmarks: TOFU and MUSE [29, 39, 12].
Dataset Splits	Yes	In the main experiments, we choose to forget the subset Forget10 and defer Forget05 and Forget01 to the Supplementary Material. The MUSE benchmark focuses on unlearning in two real-world contexts: Books and News.
Hardware Specification	Yes	For the experiments using the LLAMA 3.2 1B/3B models, we use a single A100 GPU with 40GB of memory. For all other models, we use 8 A100 80 GB GPUs within a p4de.xlarge AWS EC2 instance.
Software Dependencies	No	A paged_adamw_32bit optimizer with a learning rate of 10 5, Using torch with precision bfloat16. The paper mentions 'torch' but does not provide a specific version number for this or any other software library.
Experiment Setup	Yes	A paged_adamw_32bit optimizer with a learning rate of 10 5, Using torch with precision bfloat16. The rest of the ﬁne-tuning hyperparameters are reﬂected in Table 4.