Human Alignment of Large Language Models through Online Preference Optimisation

Authors: Daniele Calandriello, Zhaohan Daniel Guo, Remi Munos, Mark Rowland, Yunhao Tang, Bernardo Avila Pires, Pierre Harvey Richemond, Charline Le Lan, Michal Valko, Tianqi Liu, Rishabh Joshi, Zeyu Zheng, Bilal Piot

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare online IPO and IPO-MD to different online versions of existing losses on preference data such as DPO and SLi C on a summarisation task. [...] Finally, we provide an experimental suite contrasting these algorithms in several applications, which provides detailed comparisons between the proposed methods and several baselines, with notable take-aways for practitioners.
Researcher Affiliation Collaboration 1Google Deep Mind. Correspondence to: Daniele Calandriello <dcalandriello@google.com>, Zhaohan Daniel Guo <danielguo@google.com>, Michal Valko <michal.valko@inria.fr>.
Pseudocode Yes B.1. Pseudo-Codes for Offline and Online Contrastive Preference Algorithm [...] Algorithm 1 Offline Contrastive Preference Algorithms (Offline-IPO/DPO/SLi C) [...] Algorithm 2 Online Contrastive Preference Algorithms (Online-IPO/DPO/SLi C) [...] Algorithm 3 Online IPO-MD
Open Source Code No The paper does not provide an explicit statement or a link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes We use the dataset described by Stiennon et al. (2020) that has been built from the TL;DR dataset (V olske et al., 2017). [...] We train our preference and reward model on the train set DTrain, which contains 92820 examples.
Dataset Splits Yes We use validation and test prompts from the XSum dataset (Shashi et al., 2018) for evaluation for the summarisation task, which is the same procedure used by Munos et al. (2023). [...] We evaluate every checkpoint of each algorithm against the RL checkpoint (over 2000 prompts sampled from a validation split) [...] We then perform 9 side-by-side evaluation (i.e., 3 × 3 1vs1 evaluations between each of the 3 seeds for each pair of methods) using 2000 prompts from a different validation split for each comparison.
Hardware Specification Yes We use cloud Tensor Processing Units (TPUs; Jouppi et al., 2023) in their version 5e for our hardware compute, either in configurations of 2 × 4 devices for training offline experiments, or 4 × 4 devices for online experiments.
Software Dependencies No The paper mentions T5X, PaLM2, and Ada Factor, but does not provide specific version numbers for these or other software dependencies like Python, PyTorch, or TensorFlow versions, which are crucial for reproducibility.
Experiment Setup Yes We run our experiments with default parameters 10^−4 for the learning rate, and a default total of 30,000 training steps, using a batch size of 32. The τ factor is held constant throughout training, and we do not employ any warmup steps. [...] For the other algorithms, we evaluate every checkpoint of each algorithm against the RL checkpoint [...] at different learning steps values (we checkpoint every 2000 learner steps for a total of 30k learner steps), regularisation parameter τ (we sweep over 5 values {0.1, 0.5, 1.0, 5.0, 10.0}) and also β for IPO-MD and Nash MD-PG (we sweep over 2 values 0.125 and 0.25) and we take the best checkpoint.