reproducibilityindex.ai

Robust Reinforcement Learning from Corrupted Human Feedback

Authors: Alexander Bukharin, Ilgee Hong, Haoming Jiang, Zichong Li, Qingru Zhang, Zixuan Zhang, Tuo Zhao

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments on robotic control and natural language generation with large language models (LLMs) show that R3M improves robustness of the reward against several types of perturbations to the preference data.
Researcher Affiliation	Collaboration	1Georgia Tech, 2Amazon.
Pseudocode	No	The paper describes its alternating optimization algorithm in text but does not present it as formal pseudocode or an algorithm block.
Open Source Code	No	We will release the code after the submission deadline.
Open Datasets	Yes	In summarization... we use the human preferences gathered by Stiennon et al. [37] for preference optimization. In single-turn dialogue... We use the Anthropic Helpful and Harmless (HH) dialogue preferences dataset [3]...
Dataset Splits	Yes	We randomly select 800 samples from its testing split to calculate the win rate, and use the rest of the data in the testing split for validation during preference optimization.
Hardware Specification	Yes	We conduct our experiment using eight A100 GPUs, each with 40GB of memory.
Software Dependencies	No	Our implementations of robotic control tasks are based on the Stable-Baselines3 library [31] and the RL Zoo training framework [30]. For natural language generation tasks, our implementations are based on transformers [45] and trl training framework [43]. However, specific version numbers for these software components are not provided.
Experiment Setup	Yes	For R3M and the baseline (cross-entropy loss), we tune the number of epochs in {1, 3, 5} and the batch size in {8, 16, 64}. We use Adam optimizer [20] and tune the learning rate in {1e 2, 5e 3, 1e 3} for the Ant and Half Cheetah, and set the learning rate to 1e 2 for the Hopper. For R3M, we tune the λ in {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}.