Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Variational Best-of-N Alignment
Authors: Afra Amini, Tim Vieira, Elliott Ash, Ryan Cotterell
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on controlled generation and summarization tasks show that Bo N is the most effective alignment method, and our variational approximation to Bo N achieves the closest performance to Bo N and surpasses models fine-tuned using the standard KL-constrained RL objective. In the controlled generation task, v Bo N appears more frequently on the Pareto frontier of reward and KL divergence compared to other alignment methods. In the summarization task, v Bo N achieves high reward values across various sampling temperatures. |
| Researcher Affiliation | Academia | Afra Amini Tim Vieira Elliott Ash Ryan Cotterell ETH Z urich EMAIL EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 The v Bo N algorithm |
| Open Source Code | Yes | https://github.com/rycolab/vbon |
| Open Datasets | Yes | The reference model, πref, is GPT-IMDB9, a GPT-2 (Radford et al., 2019) model fine-tuned on IMDB corpus (Maas et al., 2011). We use a binary sentiment classifier,10 denoted as p, with two classes {POS, NEG} as the reward model, and define r(y) def= p(POS | y). Following Rafailov et al. (2023), we sample 5000 movie reviews from the training set of IMDB dataset and for each sample, we randomly choose a prefix length from {2,... , 8} and take that prefix as the prompt. |
| Dataset Splits | Yes | We sample 5000 movie reviews from the training set of IMDB dataset and for each sample, we randomly choose a prefix length from {2,... , 8} and take that prefix as the prompt. We further generate 512 prompts in the same way from the test set of IMDB that we use to evaluate our models. |
| Hardware Specification | Yes | Figure 4: The average reward and win rate of the aligned models improve as we increase the sample size M used for approximating the v Bo N loss function. Performance on a single A100-40GB GPU. |
| Software Dependencies | No | We use the default hyperparameters in trlx library (Havrilla et al., 2023) for fine-tuning with PPO. We implement and compare the following existing methods for language model alignment: Bo N-SFT: Perhaps the most straightforward way to approximate Bo N distribution is to fine-tune the model to maximize the likelihood of the samples taken with Bo N algorithm. Unfortunately, we find that SFT is incapable of achieving a good trade-off between achieving high rewards and low KL divergence, see App. H (Fig. 7) for the experimental results. PPO: We use PPO to optimize the KL-constrained objective in Eq. (1). |
| Experiment Setup | Yes | Hypterparameter Value Episodes 10000 Optimizer Adam W (ϵ = 1e 5, lr= 3e 6) Scheduler Linear Batch Size 32 β (Both for v Bo N and KL-constrained RL objective) 0.05 γ (Discount Factor) 1 λ (for GAE) 0.95 Number of PPO Update Iteration Per Epoch 4 PPO s Policy Clipping Coefficient 0.2 Value Clipping Coefficient 0.2 Value Function Coefficient 0.2 Value Function Loss Clipping True Sampling Temperature 0.7 |