Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Distillation Robustifies Unlearning

Authors: Bruce W Lee, Addie Foote, Alex Infanger, Leni Shor, Harish Kamath, Jacob Goldman-Wetzler, Bryce Woodworth, Alex Cloud, Alexander Turner

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We benchmark UNDO against a variety of unlearning methods on language and arithmetic tasks, extending the Pareto frontier of retain performance versus forget unlearning robustness. We also show competitive performance and robustification of unlearning on the more realistic Weapons of Mass Destruction Proxy (WMDP) benchmark.
Researcher Affiliation Academia Bruce W. Lee1,2 Addie Foote1 Alex Infanger1 Leni Shor1,3 Harish Kamath1 Jacob Goldman-Wetzler1,4 Bryce Woodworth1 Alex Cloud Alexander Matt Turner 1ML Alignment & Theory Scholars 2University of Pennsylvania 3Massachusetts Institute of Technology 4Brown University
Pseudocode Yes The pseudocode is as follows: def do_corruption(model, noise_alpha, noise_beta=0.1, seed=42): # Loop through all parameters and add random noise assert 0 <= noise_alpha <= 1 assert 0 <= noise_beta for param in model.parameters(): if param.requires_grad: # Initialize corruption tensor corruption = torch.zeros_like(param.data) # Generate appropriate noise based on parameter dimensionality if len(param.data.shape) == 2: # For weight matrices (2D tensors), use Xavier init noise = torch.nn.init.xavier_uniform_( torch.empty_like(param.data) ) elif len(param.data.shape) == 1: # For bias vectors (1D tensors), use zeros noise = torch.zeros_like(param.data) else: raise RuntimeError( f"Unsupported parameter shape: {param.data.shape}" ) # Scale the noise by beta corruption = noise_beta * noise # Apply weighted combination param.data=(1 noise_alpha)*param.data+ noise_alpha*corruption # Move model to appropriate device model.to(device)
Open Source Code Yes We share our code implementation publicly through Git Hub.
Open Datasets Yes Our process starts from collecting datasets to facilitate our language and arithmetic unlearning experiments. For language experiments, we utilize two sources: (1) Hugging Face FW/fineweb-23 Korean subset, providing non-English language examples, and (2) Hugging Face FW/fineweb-edu4, containing English language examples with high-quality content [59, 58]. We sample 10 million rows from each source.
Dataset Splits Yes Both language and arithmetic datasets have separate test sets. For language data, we allocated 1,000,000 tokens for validation, which consists of 500,000 tokens for the retain domain (English) and 500,000 tokens for the forget domain (Korean). For arithmetic data, we allocated 1,000,000 questions for validation, which consists of 500,000 questions for the retain domain (addition and subtraction) and 500,000 questions for the forget domain (multiplication and division).
Hardware Specification Yes We run all experiments on servers with multiple H200 or A100 GPUs.
Software Dependencies No The paper only mentions software names without version numbers (e.g., "Adam W optimizer", "Google Gemma-2-2b tokenizer", "open-unlearning framework") and does not specify version numbers for any key software components or libraries.
Experiment Setup Yes Table 2: Pretraining Hyperparameters, Table 3: Oracle Matching Hyperparameters, Table 4: Relearning Hyperparameters, Table 5: Unlearning Hyperparameters, Table 6: Distillation Hyperparameters. These tables explicitly list hyperparameters like learning rate, batch size, epochs, maximum steps, weight decay, LR schedule, and sequence length.