Geometric-Averaged Preference Optimization for Soft Preference Labels
Authors: Hiroki Furuta, Kuang-Huei Lee, Shixiang (Shane) Gu, Yutaka Matsuo, Aleksandra Faust, Heiga Zen, Izzeddin Gur
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments simulate the soft preference labels with AI feedback from LLMs and demonstrate that geometric averaging consistently improves performance on standard benchmarks for alignment research. |
| Researcher Affiliation | Collaboration | Hiroki Furuta1,2 Kuang-Huei Lee1 Shixiang Shane Gu1 Yutaka Matsuo2 Aleksandra Faust1 Heiga Zen1 Izzeddin Gur1 1Google Deep Mind 2The University of Tokyo |
| Pseudocode | No | The paper describes algorithms using mathematical equations and text, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | No. Our experiments are based on the open-source datasets [49, 4, 6]. We also constructed a novel dataset using the responses from LLMs, but we have not conducted an extensive toxicity check for those outputs. Currently, we are working on it for future release. |
| Open Datasets | Yes | We use the popular RLHF datasets, such as Reddit TL;DR [49, 55] (summarization), and Anthropic Helpful and Harmless [3] (conversation) for the benchmark. We construct competitive paired samples with winner responses and Pa LM 2-L to simulate diverse preference distributions that have a peak around the modest confidence (e.g. ˆp [0.7, 0.9)). We also prepare two other datasets based on Plasma Plan [6]. |
| Dataset Splits | No | The paper mentions 'validation prompts' in Appendix B ("To select the final checkpoint after RL-finetuning, we picked the last 4 checkpoints just before the length of outputs to the validation prompts started exceeding the max output tokens"), but does not specify the size or percentage of a validation split for the datasets used. |
| Hardware Specification | Yes | We used cloud TPU-v3, which has a 32 Gi B HBM memory space, with a proper number of cores. |
| Software Dependencies | No | The paper mentions using specific LLMs (Pa LM 2-XS, Pa LM 2-L instruction-tuned on Flan dataset) but does not provide specific software dependencies like libraries or frameworks with their version numbers. |
| Experiment Setup | Yes | We set β = 0.1 (Anthropic Helpful, Harmless, Plasma Plan) and β = 0.5 (Reddit TL;DR) for DPO, c DPO, ROPO following Rafailov et al. [41]. As discussed in Section 4, geometric averaging may require larger β to maintain the scale of gradient from the reliable training samples; GDPO and GROPO used β = 0.3 (Anthropic Helpful, Harmless, Plasma Plan) and β = 0.5 (Reddit TL;DR). For IPO and c IPO , we used β = 1.0 (Reddit TL;DR, Anthropic Helpful, Harmless) and β = 0.1 (Plasma Plan) as recommended in Guo et al. [20]. For GIPO, we set β to 0.5 (Reddit TL;DR, Anthropic Helpful, Harmless) and 0.05 (Plasma Plan). For ROPO and GROPO, we employed α = 2.0 and γ = 0.1 as described in Liang et al. [27]. |