reproducibilityindex.ai

Group Robust Preference Optimization in Reward-free RLHF

Authors: Shyam Sundhar Ramesh, Yifan Hu, Iason Chaimalas, Viraj Mehta, Pier Giuseppe Sessa, Haitham Bou Ammar, Ilija Bogunovic

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	By fine-tuning LLMs with GRPO using diverse group-based global opinion data, we significantly improved performance for the worst-performing groups, reduced loss imbalances across groups, and improved probability accuracies compared to non-robust baselines.
Researcher Affiliation	Collaboration	Shyam Sundhar Ramesh1 University College London (UCL) Yifan Hu ETH Zurich, EPFL Iason Chaimalas University College London (UCL) Viraj Mehta Tensor Zero Pier Giuseppe Sessa ETH Zurich Haitham Bou Ammar University College London (UCL) Huawei Noah s Ark Lab Ilija Bogunovic University College London (UCL)
Pseudocode	Yes	Algorithm 1 Mirror Descent for Group Robust Preference Optimization (GRPO)
Open Source Code	Yes	Codes for synthetic and real-world experiments can be found at https://github.com/rsshyam/GRPO-bandits and https://github.com/rsshyam/GRPO, respectively.
Open Datasets	Yes	For the real-data experiments, we consider the survey dataset Global Opinion QA ([15]) and the publicly available Gemma-2B model [48].
Dataset Splits	Yes	The data is split as 80% for training, 10% for validation and 10% for testing.
Hardware Specification	Yes	All experiments were run on a single node of A100 SXM4 machine with 40GB GPU memory, 30 CPU cores, 200GB RAM, and 525GB SSD memory.
Software Dependencies	No	The paper mentions using 'Adam W [27] optimizer' but does not provide specific version numbers for other key software components like programming languages (e.g., Python) or deep learning frameworks (e.g., PyTorch, TensorFlow).
Experiment Setup	Yes	We use the following hyperparameters for the synthetic experiments. The importance sampling methods use the same hyperparameters as the corresponding vanilla ones. Further, we note that there is no learning rates for IPO and GR-IPO as we use the closed-form solution detailed in Section 4.1 for updates. Table 1: Hyperparameters for synthetic experiments. Table 2: Hyperparameters for SFT, IPO, and GR-IPO training.