Geometric-Averaged Preference Optimization for Soft Preference Labels

Authors: Hiroki Furuta, Kuang-Huei Lee, Shixiang (Shane) Gu, Yutaka Matsuo, Aleksandra Faust, Heiga Zen, Izzeddin Gur

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments simulate the soft preference labels with AI feedback from LLMs and demonstrate that geometric averaging consistently improves performance on standard benchmarks for alignment research.
Researcher Affiliation Collaboration Hiroki Furuta1,2 Kuang-Huei Lee1 Shixiang Shane Gu1 Yutaka Matsuo2 Aleksandra Faust1 Heiga Zen1 Izzeddin Gur1 1Google Deep Mind 2The University of Tokyo
Pseudocode No The paper describes algorithms using mathematical equations and text, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No No. Our experiments are based on the open-source datasets [49, 4, 6]. We also constructed a novel dataset using the responses from LLMs, but we have not conducted an extensive toxicity check for those outputs. Currently, we are working on it for future release.
Open Datasets Yes We use the popular RLHF datasets, such as Reddit TL;DR [49, 55] (summarization), and Anthropic Helpful and Harmless [3] (conversation) for the benchmark. We construct competitive paired samples with winner responses and Pa LM 2-L to simulate diverse preference distributions that have a peak around the modest confidence (e.g. ˆp [0.7, 0.9)). We also prepare two other datasets based on Plasma Plan [6].
Dataset Splits No The paper mentions 'validation prompts' in Appendix B ("To select the final checkpoint after RL-finetuning, we picked the last 4 checkpoints just before the length of outputs to the validation prompts started exceeding the max output tokens"), but does not specify the size or percentage of a validation split for the datasets used.
Hardware Specification Yes We used cloud TPU-v3, which has a 32 Gi B HBM memory space, with a proper number of cores.
Software Dependencies No The paper mentions using specific LLMs (Pa LM 2-XS, Pa LM 2-L instruction-tuned on Flan dataset) but does not provide specific software dependencies like libraries or frameworks with their version numbers.
Experiment Setup Yes We set β = 0.1 (Anthropic Helpful, Harmless, Plasma Plan) and β = 0.5 (Reddit TL;DR) for DPO, c DPO, ROPO following Rafailov et al. [41]. As discussed in Section 4, geometric averaging may require larger β to maintain the scale of gradient from the reliable training samples; GDPO and GROPO used β = 0.3 (Anthropic Helpful, Harmless, Plasma Plan) and β = 0.5 (Reddit TL;DR). For IPO and c IPO , we used β = 1.0 (Reddit TL;DR, Anthropic Helpful, Harmless) and β = 0.1 (Plasma Plan) as recommended in Guo et al. [20]. For GIPO, we set β to 0.5 (Reddit TL;DR, Anthropic Helpful, Harmless) and 0.05 (Plasma Plan). For ROPO and GROPO, we employed α = 2.0 and γ = 0.1 as described in Liang et al. [27].