Generalized Preference Optimization: A Unified Approach to Offline Alignment
Authors: Yunhao Tang, Zhaohan Daniel Guo, Zeyu Zheng, Daniele Calandriello, Remi Munos, Mark Rowland, Pierre Harvey Richemond, Michal Valko, Bernardo Avila Pires, Bilal Piot
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our analysis and experiments reveal the connections and subtle differences between the offline regularization and the KL divergence regularization intended by the canonical RLHF formulation. |
| Researcher Affiliation | Industry | 1Google Deep Mind. Correspondence to: Yunhao Tang <robintyh@google.com>. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides an ArXiv link to the full version of the paper, but no concrete access to source code for the methodology described. |
| Open Datasets | Yes | We consider the summarization task similar to (Roit et al., 2023), where the offline dataset is an open source summarization dataset collected with human feedback labels (Stiennon et al., 2020). |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning. It mentions "evaluate checkpoints every 2k steps" but this refers to evaluation frequency during training, not data splits. |
| Hardware Specification | No | The paper mentions training models of specific sizes (e.g., "XXL model (11 billion parameters)", "Large T5X model (110 million parameters)", "700M parameters"), but it does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions software components like T5X, Adafactor optimizer, and PaLM-2 model, but it does not provide specific version numbers for these ancillary software dependencies. |
| Experiment Setup | Yes | For each β, we train the model for 2 104 steps with a constant learning rate (10 5 and 3 10 5). We evaluate checkpoints every 2k steps for a total of 20k training steps. |