Generalized Preference Optimization: A Unified Approach to Offline Alignment

Authors: Yunhao Tang, Zhaohan Daniel Guo, Zeyu Zheng, Daniele Calandriello, Remi Munos, Mark Rowland, Pierre Harvey Richemond, Michal Valko, Bernardo Avila Pires, Bilal Piot

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our analysis and experiments reveal the connections and subtle differences between the offline regularization and the KL divergence regularization intended by the canonical RLHF formulation.
Researcher Affiliation Industry 1Google Deep Mind. Correspondence to: Yunhao Tang <robintyh@google.com>.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper provides an ArXiv link to the full version of the paper, but no concrete access to source code for the methodology described.
Open Datasets Yes We consider the summarization task similar to (Roit et al., 2023), where the offline dataset is an open source summarization dataset collected with human feedback labels (Stiennon et al., 2020).
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning. It mentions "evaluate checkpoints every 2k steps" but this refers to evaluation frequency during training, not data splits.
Hardware Specification No The paper mentions training models of specific sizes (e.g., "XXL model (11 billion parameters)", "Large T5X model (110 million parameters)", "700M parameters"), but it does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions software components like T5X, Adafactor optimizer, and PaLM-2 model, but it does not provide specific version numbers for these ancillary software dependencies.
Experiment Setup Yes For each β, we train the model for 2 104 steps with a constant learning rate (10 5 and 3 10 5). We evaluate checkpoints every 2k steps for a total of 20k training steps.