Robust Reinforcement Learning from Corrupted Human Feedback
Authors: Alexander Bukharin, Ilgee Hong, Haoming Jiang, Zichong Li, Qingru Zhang, Zixuan Zhang, Tuo Zhao
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on robotic control and natural language generation with large language models (LLMs) show that R3M improves robustness of the reward against several types of perturbations to the preference data. |
| Researcher Affiliation | Collaboration | 1Georgia Tech, 2Amazon. |
| Pseudocode | No | The paper describes its alternating optimization algorithm in text but does not present it as formal pseudocode or an algorithm block. |
| Open Source Code | No | We will release the code after the submission deadline. |
| Open Datasets | Yes | In summarization... we use the human preferences gathered by Stiennon et al. [37] for preference optimization. In single-turn dialogue... We use the Anthropic Helpful and Harmless (HH) dialogue preferences dataset [3]... |
| Dataset Splits | Yes | We randomly select 800 samples from its testing split to calculate the win rate, and use the rest of the data in the testing split for validation during preference optimization. |
| Hardware Specification | Yes | We conduct our experiment using eight A100 GPUs, each with 40GB of memory. |
| Software Dependencies | No | Our implementations of robotic control tasks are based on the Stable-Baselines3 library [31] and the RL Zoo training framework [30]. For natural language generation tasks, our implementations are based on transformers [45] and trl training framework [43]. However, specific version numbers for these software components are not provided. |
| Experiment Setup | Yes | For R3M and the baseline (cross-entropy loss), we tune the number of epochs in {1, 3, 5} and the batch size in {8, 16, 64}. We use Adam optimizer [20] and tune the learning rate in {1e 2, 5e 3, 1e 3} for the Ant and Half Cheetah, and set the learning rate to 1e 2 for the Hopper. For R3M, we tune the λ in {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}. |