Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Better Estimation of the Kullback--Leibler Divergence Between Language Models

Authors: Afra Amini, Tim Vieira, Ryan Cotterell

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In an empirical study on sentiment-controlled fine-tuning, we show that our estimator provides more stable KL estimates and reduces variance substantially. Additionally, we derive an analogous Rao Blackwellized estimator of the gradient of the KL divergence, which leads to more stable training and produces models that more frequently appear on the Pareto frontier of reward vs. KL compared to the ones trained with the MC estimator of the gradient. ... We empirically validate our theoretical findings using the sentiment-controlled generation task [29] as a testbed. ... Our experimental results confirm that our proposed estimator significantly reduces the variance of the Monte Carlo estimator, yielding the most stable and reliable estimates among all methods studied.
Researcher Affiliation	Academia	Afra Amini Tim Vieira Ryan Cotterell ETH Z urich EMAIL EMAIL
Pseudocode	Yes	We provide a code snippet for implementing the RB estimator in App. F.1. 1 def compute_kl(logprobs, ref_logprobs, logits, ref_logits): 3 Compute KL divergence using two estimators. 6 logprobs: Log probabilities of sampled actions from policy 7 ref_logprobs: Log probabilities of same actions from reference 8 logits: Full distribution logits from policy (all actions) 9 ref_logits: Full distribution logits from reference (all actions) 11 Example: 12 # Policy samples action 3 from 1000 possible actions 13 logprobs = [-2.3] # Only action 3's log prob 14 logits = [0.1, -0.5, ...] # All 1000 action logits (raw) 16 # MC: uses only sampled action 17 # RB: uses all 1000 actions for exact expectation 19 # Monte Carlo: unbiased but higher variance 20 kl_mc = mean(logprobs ref_logprobs) 22 # Rao-Blackwell: lower variance, uses full distribution 23 log_p = log_softmax(logits) # Normalize to log probs 24 log_q = log_softmax(ref_logits) # Normalize to log probs 25 kl_rb = mean(sum(exp(log_p) * (log_p log_q))) 27 return kl_mc, kl_rb
Open Source Code	Yes	https://github.com/rycolab/kl-rb
Open Datasets	Yes	Concretely, the reference model, denoted as q, is the GPT-IMDB14 model, i.e., a GPT-2 [28] model fine-tuned on IMDB corpus [23]. ... [23] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.
Dataset Splits	Yes	To obtain pθ, we fine-tune q with direct preference optimization [DPO; 29] on a sample of 5,000 data points from the IMDB training set. ... We evaluate on 512 examples from the IMDB dataset.
Hardware Specification	Yes	Each experiment takes approximately 20 minutes on a single rtx 4090 GPU.
Software Dependencies	No	The paper mentions using GPT-2 models, fine-tuning with DPO, and a specific RL algorithm, but does not provide explicit version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used.
Experiment Setup	Yes	In App. F.2, we include the hyperparameters used with the RLOO algorithm for the sentiment control experiment. ... Hypterparameter Value Optimizer Adam W (ϵ = 1e 5, lr= 3e 6) Scheduler Linear Batch Size 32 β 0.07 k 2 Number of RLOO Updates Iteration Per Epoch 4 Clip range 0.2 Sampling Temperature 1