Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Meta-Learning Objectives for Preference Optimization

Authors: Carlo Alfano, Silvia Sapora, Jakob Foerster, Patrick Rebeschini, Yee Whye Teh

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we show that it is possible to gain insights on the efficacy of PO algorithm on simpler benchmarks. We design a diagnostic suite of Mu Jo Co tasks and datasets, which we use to systematically evaluate PO algorithms, establishing a more controlled and cheaper benchmark. We then propose a novel family of PO algorithms based on mirror descent, which we call Mirror Preference Optimization (MPO). Through evolutionary strategies, we search this class to discover algorithms specialized to specific properties of preference datasets, such as mixed-quality or noisy data. We demonstrate that our discovered PO algorithms outperform all known algorithms in the targeted Mu Jo Co settings. Finally, based on the insights gained from our Mu Jo Co experiments, we design a PO algorithm that significantly outperform existing baselines in an LLM alignment task.
Researcher Affiliation Academia Carlo Alfano Department of Statistics University of Oxford Silvia Sapora Department of Statistics University of Oxford Jakob N. Foerster Department of Engineering University of Oxford Patrick Rebeschini Department of Statistics University of Oxford Yee Whye Teh Department of Statistics University of Oxford
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. Methodologies are described through mathematical equations and textual descriptions, for example, in Section 3 'Mirror Preference Optimization'.
Open Source Code Yes To maximize computational efficiency, all our Mu Jo Co experiments are implemented in JAX (Bradbury et al., 2018) using the brax (Freeman et al., 2021) and evosax (Lange, 2022) libraries. We provide an implementation of our methodology here and report hyper-parameters in Appendix K.
Open Datasets Yes We design a diagnostic suite of Mu Jo Co tasks and datasets, which we use to systematically evaluate PO algorithms, establishing a more controlled and cheaper benchmark. [...] We evaluate the tuned LLMs against GPT-4, using the Alpaca Eval library (Li et al., 2023) and Llama-3.1-70B-Instruct as a judge. [...] gemma-7b , dpo-mix-7k gemma-7b, capybara-7k mistral-7b , dpo-mix-7k [...] https://huggingface.co/datasets/argilla/dpo-mix-7k https://huggingface.co/datasets/argilla/distilabel-capybara-dpo-7k-binarized
Dataset Splits No We then generate a preference dataset of 1280 rows, each with two trajectories of length 1000 starting from the same state. Each trajectory is generated by either the original or the target agent, depending on the current setting. A Bradley-Terry judge ranks each pair of trajectories and declares a winner, based on their true cumulative reward. We consider three variations of the preference dataset, each meant to represent a common issue of real world data. [...] For all combinations of starting LLM, dataset, and PO algorithm, we perform 4 update epochs and set the learning rate to 5e-5 and β to 0.05.
Hardware Specification Yes All the experiments were conducted on 4 NVIDIA L40S GPUs.
Software Dependencies No To maximize computational efficiency, all our Mu Jo Co experiments are implemented in JAX (Bradbury et al., 2018) using the brax (Freeman et al., 2021) and evosax (Lange, 2022) libraries. We provide an implementation of our methodology here and report hyper-parameters in Appendix K. [...] To tune the LLMs, we modify the Alignment Handbook library (Tunstall et al.) to include the Te MPO objective in (13).
Experiment Setup Yes All algorithms are run for 12 epochs over the preference dataset, with the exception of DPO, IPO, Sim PO and R-DPO, which are run for 2 epochs after 10 epochs of SFT. We provide an additional noisy setting (ε = 0.2) and performance for other existing algorithms in Table 5 in Appendix I.1. [...] We give the hyper-parameters we use for training. The hyper-parameters specific to each algorithm are tuned for each task-data type combination. All the experiments were conducted on 4 NVIDIA L40S GPUs. Table 7: Hyper-parameter settings for PO. Parameter Value Number of epochs 12 Minibatch size 2 Learning rate 1e-3 Max gradient norm 1.3 [...] Table 9: Hyper-parameter settings for LLM Training. Parameter Value Gradient Accumulation Step 32 Batch Size 2 Total Batch Size 64 Lo RA Yes Lo RA Rank 128 Lo RA Alpha 256 Lora Dropout 0.05 Max length 2048