Reinforcement Learning from Diverse Human Preferences
Authors: Wanqi Xue, Bo An, Shuicheng Yan, Zhongwen Xu
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our method on a variety of complex locomotion and robotic manipulation tasks (see Fig. 2) from Deep Mind Control Suite (DMControl) [Tassa et al., 2018; Tunyasuvunakool et al., 2020] and Meta-world [Yu et al., 2020]. The results show that our method is able to effectively recover the performance of existing preference-based RL algorithms under diverse preferences in all the tasks. |
| Researcher Affiliation | Collaboration | Wanqi Xue1 , Bo An1,2 , Shuicheng Yan2 and Zhongwen Xu3 1Nanyang Technological University 2Skywork AI, Singapore 3Tencent AI Lab |
| Pseudocode | Yes | Algorithm 1 RL from Diverse Human Preferences |
| Open Source Code | No | The paper does not provide any specific links to source code or explicit statements about its public availability. |
| Open Datasets | Yes | We evaluate our method on several complex locomotion tasks and robotic manipulation tasks from Deep Mind Control Suite (DMControl) [Tassa et al., 2018; Tunyasuvunakool et al., 2020] and Meta-world [Yu et al., 2020], respectively (see Fig. 2). |
| Dataset Splits | No | The paper does not explicitly provide details about training/validation/test dataset splits, such as specific percentages, sample counts, or splitting methodology. It mentions 'training' and 'evaluating' but not the specific data partitioning. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific library versions). |
| Experiment Setup | Yes | For all tasks, we use the same hyperparameters used by PEBBLE, such as learning rates, architectures of the neuron networks, and reward model updating frequency. We adopt an uniform sampling strategy, which selects queries with the same probability. At each feedback session, a batch of 256 trajectory segments (σ0, σ1) is sampled for annotation. The strength of constraint ϕ is set as 100 for all tasks. |