Group Robust Preference Optimization in Reward-free RLHF

Authors: Shyam Sundhar Ramesh, Yifan Hu, Iason Chaimalas, Viraj Mehta, Pier Giuseppe Sessa, Haitham Bou Ammar, Ilija Bogunovic

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By fine-tuning LLMs with GRPO using diverse group-based global opinion data, we significantly improved performance for the worst-performing groups, reduced loss imbalances across groups, and improved probability accuracies compared to non-robust baselines.
Researcher Affiliation Collaboration Shyam Sundhar Ramesh1 University College London (UCL) Yifan Hu ETH Zurich, EPFL Iason Chaimalas University College London (UCL) Viraj Mehta Tensor Zero Pier Giuseppe Sessa ETH Zurich Haitham Bou Ammar University College London (UCL) Huawei Noah s Ark Lab Ilija Bogunovic University College London (UCL)
Pseudocode Yes Algorithm 1 Mirror Descent for Group Robust Preference Optimization (GRPO)
Open Source Code Yes Codes for synthetic and real-world experiments can be found at https://github.com/rsshyam/GRPO-bandits and https://github.com/rsshyam/GRPO, respectively.
Open Datasets Yes For the real-data experiments, we consider the survey dataset Global Opinion QA ([15]) and the publicly available Gemma-2B model [48].
Dataset Splits Yes The data is split as 80% for training, 10% for validation and 10% for testing.
Hardware Specification Yes All experiments were run on a single node of A100 SXM4 machine with 40GB GPU memory, 30 CPU cores, 200GB RAM, and 525GB SSD memory.
Software Dependencies No The paper mentions using 'Adam W [27] optimizer' but does not provide specific version numbers for other key software components like programming languages (e.g., Python) or deep learning frameworks (e.g., PyTorch, TensorFlow).
Experiment Setup Yes We use the following hyperparameters for the synthetic experiments. The importance sampling methods use the same hyperparameters as the corresponding vanilla ones. Further, we note that there is no learning rates for IPO and GR-IPO as we use the closed-form solution detailed in Section 4.1 for updates. Table 1: Hyperparameters for synthetic experiments. Table 2: Hyperparameters for SFT, IPO, and GR-IPO training.