reproducibilityindex.ai

Query-Policy Misalignment in Preference-Based Reinforcement Learning

Authors: Xiao Hu, Jianxiong Li, Xianyuan Zhan, Qing-Shan Jia, Ya-Qin Zhang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We showcase in comprehensive experiments that our method achieves substantial gains in both human feedback and RL sample efficiency, demonstrating the importance of addressing query-policy misalignment in Pb RL tasks.In this section, we present extensive evaluations on 6 locomotion tasks in DMControl (Tassa et al., 2018) and 3 robotic manipulation tasks in Meta World (Yu et al., 2020).
Researcher Affiliation	Academia	Xiao Hu1 , Jianxiong Li1 , Xianyuan Zhan1,2 , Qing-Shan Jia1 & Ya-Qin Zhang1 1 Tsinghua University, Beijing, China 2 Shanghai Artificial Intelligence Laboratory, Shanghai, China
Pseudocode	Yes	We provide the procedure of QPA in Algorithm 1. QPA is compatible with existing off-policy Pb RL methods, with only 20 lines of code modifications on top of the widely-adopted Pb RL backbone framework B-Pref (Lee et al., 2021a).
Open Source Code	Yes	Code is available at https://github.com/huxiao09/QPA.
Open Datasets	Yes	We evaluate our proposed method on benchmark environments in Deep Mind Control Suite (DMControl) (Tassa et al., 2018) and Meta World (Yu et al., 2020).
Dataset Splits	No	The paper describes its experimental evaluation using terms like "training process" and "evaluations on locomotion tasks" and mentions a "replay buffer D," but it does not specify explicit training, validation, and test dataset splits with percentages, counts, or specific predefined split citations required for reproduction.
Hardware Specification	No	The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies	No	The paper mentions software components like "PyTorch" and "SAC (Soft Actor-Critic)" as well as the "B-Pref" framework, but it does not provide specific version numbers for these or any other ancillary software components required for replication.
Experiment Setup	Yes	QPA, PEBBLE, and SURF all employ SAC (Soft Actor-Critic) (Haarnoja et al., 2018) for policy learning and share the same hyperparameters of SAC. We provide the full list of hyperparameters of SAC in Table 1. Both QPA and SURF utilize PEBBLE as the off-policy Pb RL backbone algorithm and share the same hyperparameters of PEBBLE as listed in Table 2. The additional hyperparameters of SURF based on PEBBLE are set according to their paper and are listed in Table 3. The additional hyperparameters of QPA are presented in Table 4.