Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Antidistillation Sampling
Authors: Yash Savani, Asher Trockman, Zhili Feng, Yixuan Even Xu, Avi Schwarzschild, Alexander Robey, Marc Finzi, Zico Kolter
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through a range of experiments, we demonstrate the effectiveness of antidistillation sampling and discuss several interesting phenomena. |
| Researcher Affiliation | Academia | Yash Savani Asher Trockman Zhili Feng Yixuan Even Xu Avi Schwarzschild Alexander Robey Marc Finzi J. Zico Kolter Carnegie Mellon University |
| Pseudocode | Yes | Algorithm 1: Antidistillation sampling Input: Prompt x1:n, max tokens N, penalty multiplier λ, approximation parameter ϵ, temperature τ 1. (Initialization) Compute the gradient of the downstream loss 2. For each token index t = n, n + 1, . . . , N 1: i. Compute the antidistillation penalty term b ( |x1:t) log p( |x1:t; θP + ϵg) log p( |x1:t; θP ϵg) 2ϵ ii. Sample the next token xt+1 from the teacher s adjusted distribution τ log p( |x1:t; θT ) + λb ( |x1:t) Output: Sampled sequence x1:N |
| Open Source Code | Yes | Our code is available at https://github.com/locuslab/ antidistillation-sampling. |
| Open Datasets | Yes | We evaluate the performance of antidistillation sampling on GSM8K [9] (we use GSM8K Platinum for the test set [45]), MATH [10], and MMLU [11] benchmarks (all provided under the MIT license)... |
| Dataset Splits | Yes | For our experiments, we use the first 70% of our train data as the training set and the remaining 30% as the holdout set. |
| Hardware Specification | Yes | All of our experiments are performed on nodes with 8 NVIDIA H100 GPUs and we use the transformers package [46], the trl toolkit [47], and the accelerate library [48]. |
| Software Dependencies | No | All of our experiments are performed on nodes with 8 NVIDIA H100 GPUs and we use the transformers package [46], the trl toolkit [47], and the accelerate library [48]. |
| Experiment Setup | Yes | Distillation protocol. All distillation experiments use Lo RA [49] with rank 128, α = 128, and dropout probability 0 Our optimization protocol employs a learning rate of 0.0005, weight decay coefficient of 0.1, and gradient clipping at norm 1.0. Training follows a cosine learning rate schedule with warm-up over the first 10% of training, batch size 32, for 4 epochs. These values are the result of a systematic hyperparameter sweep using the MATH dataset to find configurations that maximize student performance gain. ... We use a max generation length of 1024 for both GSM8K and MMLU and 2048 for MATH. For antidistillation sampling, we use a temperature of τ = 0.6; we found that sweeping between τ [0, 1] does not significantly impact antidistillation performance. |