Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Better Estimation of the Kullback--Leibler Divergence Between Language Models
Authors: Afra Amini, Tim Vieira, Ryan Cotterell
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In an empirical study on sentiment-controlled fine-tuning, we show that our estimator provides more stable KL estimates and reduces variance substantially. Additionally, we derive an analogous Rao Blackwellized estimator of the gradient of the KL divergence, which leads to more stable training and produces models that more frequently appear on the Pareto frontier of reward vs. KL compared to the ones trained with the MC estimator of the gradient. ... We empirically validate our theoretical findings using the sentiment-controlled generation task [29] as a testbed. ... Our experimental results confirm that our proposed estimator significantly reduces the variance of the Monte Carlo estimator, yielding the most stable and reliable estimates among all methods studied. |
| Researcher Affiliation | Academia | Afra Amini Tim Vieira Ryan Cotterell ETH Z urich EMAIL EMAIL |
| Pseudocode | Yes | We provide a code snippet for implementing the RB estimator in App. F.1. 1 def compute_kl(logprobs, ref_logprobs, logits, ref_logits): 3 Compute KL divergence using two estimators. 6 logprobs: Log probabilities of sampled actions from policy 7 ref_logprobs: Log probabilities of same actions from reference 8 logits: Full distribution logits from policy (all actions) 9 ref_logits: Full distribution logits from reference (all actions) 11 Example: 12 # Policy samples action 3 from 1000 possible actions 13 logprobs = [-2.3] # Only action 3's log prob 14 logits = [0.1, -0.5, ...] # All 1000 action logits (raw) 16 # MC: uses only sampled action 17 # RB: uses all 1000 actions for exact expectation 19 # Monte Carlo: unbiased but higher variance 20 kl_mc = mean(logprobs ref_logprobs) 22 # Rao-Blackwell: lower variance, uses full distribution 23 log_p = log_softmax(logits) # Normalize to log probs 24 log_q = log_softmax(ref_logits) # Normalize to log probs 25 kl_rb = mean(sum(exp(log_p) * (log_p log_q))) 27 return kl_mc, kl_rb |
| Open Source Code | Yes | https://github.com/rycolab/kl-rb |
| Open Datasets | Yes | Concretely, the reference model, denoted as q, is the GPT-IMDB14 model, i.e., a GPT-2 [28] model fine-tuned on IMDB corpus [23]. ... [23] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. |
| Dataset Splits | Yes | To obtain pĪø, we fine-tune q with direct preference optimization [DPO; 29] on a sample of 5,000 data points from the IMDB training set. ... We evaluate on 512 examples from the IMDB dataset. |
| Hardware Specification | Yes | Each experiment takes approximately 20 minutes on a single rtx 4090 GPU. |
| Software Dependencies | No | The paper mentions using GPT-2 models, fine-tuning with DPO, and a specific RL algorithm, but does not provide explicit version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used. |
| Experiment Setup | Yes | In App. F.2, we include the hyperparameters used with the RLOO algorithm for the sentiment control experiment. ... Hypterparameter Value Optimizer Adam W (ϵ = 1e 5, lr= 3e 6) Scheduler Linear Batch Size 32 β 0.07 k 2 Number of RLOO Updates Iteration Per Epoch 4 Clip range 0.2 Sampling Temperature 1 |