Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Antidistillation Sampling

Authors: Yash Savani, Asher Trockman, Zhili Feng, Yixuan Even Xu, Avi Schwarzschild, Alexander Robey, Marc Finzi, Zico Kolter

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through a range of experiments, we demonstrate the effectiveness of antidistillation sampling and discuss several interesting phenomena.
Researcher Affiliation	Academia	Yash Savani Asher Trockman Zhili Feng Yixuan Even Xu Avi Schwarzschild Alexander Robey Marc Finzi J. Zico Kolter Carnegie Mellon University
Pseudocode	Yes	Algorithm 1: Antidistillation sampling Input: Prompt x1:n, max tokens N, penalty multiplier λ, approximation parameter ϵ, temperature τ 1. (Initialization) Compute the gradient of the downstream loss 2. For each token index t = n, n + 1, . . . , N 1: i. Compute the antidistillation penalty term b ( \|x1:t) log p( \|x1:t; θP + ϵg) log p( \|x1:t; θP ϵg) 2ϵ ii. Sample the next token xt+1 from the teacher s adjusted distribution τ log p( \|x1:t; θT ) + λb ( \|x1:t) Output: Sampled sequence x1:N
Open Source Code	Yes	Our code is available at https://github.com/locuslab/ antidistillation-sampling.
Open Datasets	Yes	We evaluate the performance of antidistillation sampling on GSM8K [9] (we use GSM8K Platinum for the test set [45]), MATH [10], and MMLU [11] benchmarks (all provided under the MIT license)...
Dataset Splits	Yes	For our experiments, we use the first 70% of our train data as the training set and the remaining 30% as the holdout set.
Hardware Specification	Yes	All of our experiments are performed on nodes with 8 NVIDIA H100 GPUs and we use the transformers package [46], the trl toolkit [47], and the accelerate library [48].
Software Dependencies	No	All of our experiments are performed on nodes with 8 NVIDIA H100 GPUs and we use the transformers package [46], the trl toolkit [47], and the accelerate library [48].
Experiment Setup	Yes	Distillation protocol. All distillation experiments use Lo RA [49] with rank 128, α = 128, and dropout probability 0 Our optimization protocol employs a learning rate of 0.0005, weight decay coefficient of 0.1, and gradient clipping at norm 1.0. Training follows a cosine learning rate schedule with warm-up over the first 10% of training, batch size 32, for 4 epochs. These values are the result of a systematic hyperparameter sweep using the MATH dataset to find configurations that maximize student performance gain. ... We use a max generation length of 1024 for both GSM8K and MMLU and 2048 for MATH. For antidistillation sampling, we use a temperature of τ = 0.6; we found that sweeping between τ [0, 1] does not significantly impact antidistillation performance.