reproducibilityindex.ai

Geometric-Averaged Preference Optimization for Soft Preference Labels

Authors: Hiroki Furuta, Kuang-Huei Lee, Shixiang (Shane) Gu, Yutaka Matsuo, Aleksandra Faust, Heiga Zen, Izzeddin Gur

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments simulate the soft preference labels with AI feedback from LLMs and demonstrate that geometric averaging consistently improves performance on standard benchmarks for alignment research.
Researcher Affiliation	Collaboration	Hiroki Furuta1,2 Kuang-Huei Lee1 Shixiang Shane Gu1 Yutaka Matsuo2 Aleksandra Faust1 Heiga Zen1 Izzeddin Gur1 1Google Deep Mind 2The University of Tokyo
Pseudocode	No	The paper describes algorithms using mathematical equations and text, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	No. Our experiments are based on the open-source datasets [49, 4, 6]. We also constructed a novel dataset using the responses from LLMs, but we have not conducted an extensive toxicity check for those outputs. Currently, we are working on it for future release.
Open Datasets	Yes	We use the popular RLHF datasets, such as Reddit TL;DR [49, 55] (summarization), and Anthropic Helpful and Harmless [3] (conversation) for the benchmark. We construct competitive paired samples with winner responses and Pa LM 2-L to simulate diverse preference distributions that have a peak around the modest confidence (e.g. ˆp [0.7, 0.9)). We also prepare two other datasets based on Plasma Plan [6].
Dataset Splits	No	The paper mentions 'validation prompts' in Appendix B ("To select the final checkpoint after RL-finetuning, we picked the last 4 checkpoints just before the length of outputs to the validation prompts started exceeding the max output tokens"), but does not specify the size or percentage of a validation split for the datasets used.
Hardware Specification	Yes	We used cloud TPU-v3, which has a 32 Gi B HBM memory space, with a proper number of cores.
Software Dependencies	No	The paper mentions using specific LLMs (Pa LM 2-XS, Pa LM 2-L instruction-tuned on Flan dataset) but does not provide specific software dependencies like libraries or frameworks with their version numbers.
Experiment Setup	Yes	We set β = 0.1 (Anthropic Helpful, Harmless, Plasma Plan) and β = 0.5 (Reddit TL;DR) for DPO, c DPO, ROPO following Rafailov et al. [41]. As discussed in Section 4, geometric averaging may require larger β to maintain the scale of gradient from the reliable training samples; GDPO and GROPO used β = 0.3 (Anthropic Helpful, Harmless, Plasma Plan) and β = 0.5 (Reddit TL;DR). For IPO and c IPO , we used β = 1.0 (Reddit TL;DR, Anthropic Helpful, Harmless) and β = 0.1 (Plasma Plan) as recommended in Guo et al. [20]. For GIPO, we set β to 0.5 (Reddit TL;DR, Anthropic Helpful, Harmless) and 0.05 (Plasma Plan). For ROPO and GROPO, we employed α = 2.0 and γ = 0.1 as described in Liang et al. [27].