Query-Policy Misalignment in Preference-Based Reinforcement Learning

Authors: Xiao Hu, Jianxiong Li, Xianyuan Zhan, Qing-Shan Jia, Ya-Qin Zhang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We showcase in comprehensive experiments that our method achieves substantial gains in both human feedback and RL sample efficiency, demonstrating the importance of addressing query-policy misalignment in Pb RL tasks.In this section, we present extensive evaluations on 6 locomotion tasks in DMControl (Tassa et al., 2018) and 3 robotic manipulation tasks in Meta World (Yu et al., 2020).
Researcher Affiliation Academia Xiao Hu1 , Jianxiong Li1 , Xianyuan Zhan1,2 , Qing-Shan Jia1 & Ya-Qin Zhang1 1 Tsinghua University, Beijing, China 2 Shanghai Artificial Intelligence Laboratory, Shanghai, China
Pseudocode Yes We provide the procedure of QPA in Algorithm 1. QPA is compatible with existing off-policy Pb RL methods, with only 20 lines of code modifications on top of the widely-adopted Pb RL backbone framework B-Pref (Lee et al., 2021a).
Open Source Code Yes Code is available at https://github.com/huxiao09/QPA.
Open Datasets Yes We evaluate our proposed method on benchmark environments in Deep Mind Control Suite (DMControl) (Tassa et al., 2018) and Meta World (Yu et al., 2020).
Dataset Splits No The paper describes its experimental evaluation using terms like "training process" and "evaluations on locomotion tasks" and mentions a "replay buffer D," but it does not specify explicit training, validation, and test dataset splits with percentages, counts, or specific predefined split citations required for reproduction.
Hardware Specification No The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies No The paper mentions software components like "PyTorch" and "SAC (Soft Actor-Critic)" as well as the "B-Pref" framework, but it does not provide specific version numbers for these or any other ancillary software components required for replication.
Experiment Setup Yes QPA, PEBBLE, and SURF all employ SAC (Soft Actor-Critic) (Haarnoja et al., 2018) for policy learning and share the same hyperparameters of SAC. We provide the full list of hyperparameters of SAC in Table 1. Both QPA and SURF utilize PEBBLE as the off-policy Pb RL backbone algorithm and share the same hyperparameters of PEBBLE as listed in Table 2. The additional hyperparameters of SURF based on PEBBLE are set according to their paper and are listed in Table 3. The additional hyperparameters of QPA are presented in Table 4.