Robust Reinforcement Learning from Corrupted Human Feedback

Authors: Alexander Bukharin, Ilgee Hong, Haoming Jiang, Zichong Li, Qingru Zhang, Zixuan Zhang, Tuo Zhao

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on robotic control and natural language generation with large language models (LLMs) show that R3M improves robustness of the reward against several types of perturbations to the preference data.
Researcher Affiliation Collaboration 1Georgia Tech, 2Amazon.
Pseudocode No The paper describes its alternating optimization algorithm in text but does not present it as formal pseudocode or an algorithm block.
Open Source Code No We will release the code after the submission deadline.
Open Datasets Yes In summarization... we use the human preferences gathered by Stiennon et al. [37] for preference optimization. In single-turn dialogue... We use the Anthropic Helpful and Harmless (HH) dialogue preferences dataset [3]...
Dataset Splits Yes We randomly select 800 samples from its testing split to calculate the win rate, and use the rest of the data in the testing split for validation during preference optimization.
Hardware Specification Yes We conduct our experiment using eight A100 GPUs, each with 40GB of memory.
Software Dependencies No Our implementations of robotic control tasks are based on the Stable-Baselines3 library [31] and the RL Zoo training framework [30]. For natural language generation tasks, our implementations are based on transformers [45] and trl training framework [43]. However, specific version numbers for these software components are not provided.
Experiment Setup Yes For R3M and the baseline (cross-entropy loss), we tune the number of epochs in {1, 3, 5} and the batch size in {8, 16, 64}. We use Adam optimizer [20] and tune the learning rate in {1e 2, 5e 3, 1e 3} for the Ant and Half Cheetah, and set the learning rate to 1e 2 for the Hopper. For R3M, we tune the λ in {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}.