Reinforcement Learning from Diverse Human Preferences

Authors: Wanqi Xue, Bo An, Shuicheng Yan, Zhongwen Xu

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of our method on a variety of complex locomotion and robotic manipulation tasks (see Fig. 2) from Deep Mind Control Suite (DMControl) [Tassa et al., 2018; Tunyasuvunakool et al., 2020] and Meta-world [Yu et al., 2020]. The results show that our method is able to effectively recover the performance of existing preference-based RL algorithms under diverse preferences in all the tasks.
Researcher Affiliation Collaboration Wanqi Xue1 , Bo An1,2 , Shuicheng Yan2 and Zhongwen Xu3 1Nanyang Technological University 2Skywork AI, Singapore 3Tencent AI Lab
Pseudocode Yes Algorithm 1 RL from Diverse Human Preferences
Open Source Code No The paper does not provide any specific links to source code or explicit statements about its public availability.
Open Datasets Yes We evaluate our method on several complex locomotion tasks and robotic manipulation tasks from Deep Mind Control Suite (DMControl) [Tassa et al., 2018; Tunyasuvunakool et al., 2020] and Meta-world [Yu et al., 2020], respectively (see Fig. 2).
Dataset Splits No The paper does not explicitly provide details about training/validation/test dataset splits, such as specific percentages, sample counts, or splitting methodology. It mentions 'training' and 'evaluating' but not the specific data partitioning.
Hardware Specification No The paper does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running its experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific library versions).
Experiment Setup Yes For all tasks, we use the same hyperparameters used by PEBBLE, such as learning rates, architectures of the neuron networks, and reward model updating frequency. We adopt an uniform sampling strategy, which selects queries with the same probability. At each feedback session, a batch of 256 trajectory segments (σ0, σ1) is sampled for annotation. The strength of constraint ϕ is set as 100 for all tasks.