Group Robust Preference Optimization in Reward-free RLHF
Authors: Shyam Sundhar Ramesh, Yifan Hu, Iason Chaimalas, Viraj Mehta, Pier Giuseppe Sessa, Haitham Bou Ammar, Ilija Bogunovic
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | By fine-tuning LLMs with GRPO using diverse group-based global opinion data, we significantly improved performance for the worst-performing groups, reduced loss imbalances across groups, and improved probability accuracies compared to non-robust baselines. |
| Researcher Affiliation | Collaboration | Shyam Sundhar Ramesh1 University College London (UCL) Yifan Hu ETH Zurich, EPFL Iason Chaimalas University College London (UCL) Viraj Mehta Tensor Zero Pier Giuseppe Sessa ETH Zurich Haitham Bou Ammar University College London (UCL) Huawei Noah s Ark Lab Ilija Bogunovic University College London (UCL) |
| Pseudocode | Yes | Algorithm 1 Mirror Descent for Group Robust Preference Optimization (GRPO) |
| Open Source Code | Yes | Codes for synthetic and real-world experiments can be found at https://github.com/rsshyam/GRPO-bandits and https://github.com/rsshyam/GRPO, respectively. |
| Open Datasets | Yes | For the real-data experiments, we consider the survey dataset Global Opinion QA ([15]) and the publicly available Gemma-2B model [48]. |
| Dataset Splits | Yes | The data is split as 80% for training, 10% for validation and 10% for testing. |
| Hardware Specification | Yes | All experiments were run on a single node of A100 SXM4 machine with 40GB GPU memory, 30 CPU cores, 200GB RAM, and 525GB SSD memory. |
| Software Dependencies | No | The paper mentions using 'Adam W [27] optimizer' but does not provide specific version numbers for other key software components like programming languages (e.g., Python) or deep learning frameworks (e.g., PyTorch, TensorFlow). |
| Experiment Setup | Yes | We use the following hyperparameters for the synthetic experiments. The importance sampling methods use the same hyperparameters as the corresponding vanilla ones. Further, we note that there is no learning rates for IPO and GR-IPO as we use the closed-form solution detailed in Section 4.1 for updates. Table 1: Hyperparameters for synthetic experiments. Table 2: Hyperparameters for SFT, IPO, and GR-IPO training. |