Query-Policy Misalignment in Preference-Based Reinforcement Learning
Authors: Xiao Hu, Jianxiong Li, Xianyuan Zhan, Qing-Shan Jia, Ya-Qin Zhang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We showcase in comprehensive experiments that our method achieves substantial gains in both human feedback and RL sample efficiency, demonstrating the importance of addressing query-policy misalignment in Pb RL tasks.In this section, we present extensive evaluations on 6 locomotion tasks in DMControl (Tassa et al., 2018) and 3 robotic manipulation tasks in Meta World (Yu et al., 2020). |
| Researcher Affiliation | Academia | Xiao Hu1 , Jianxiong Li1 , Xianyuan Zhan1,2 , Qing-Shan Jia1 & Ya-Qin Zhang1 1 Tsinghua University, Beijing, China 2 Shanghai Artificial Intelligence Laboratory, Shanghai, China |
| Pseudocode | Yes | We provide the procedure of QPA in Algorithm 1. QPA is compatible with existing off-policy Pb RL methods, with only 20 lines of code modifications on top of the widely-adopted Pb RL backbone framework B-Pref (Lee et al., 2021a). |
| Open Source Code | Yes | Code is available at https://github.com/huxiao09/QPA. |
| Open Datasets | Yes | We evaluate our proposed method on benchmark environments in Deep Mind Control Suite (DMControl) (Tassa et al., 2018) and Meta World (Yu et al., 2020). |
| Dataset Splits | No | The paper describes its experimental evaluation using terms like "training process" and "evaluations on locomotion tasks" and mentions a "replay buffer D," but it does not specify explicit training, validation, and test dataset splits with percentages, counts, or specific predefined split citations required for reproduction. |
| Hardware Specification | No | The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments. |
| Software Dependencies | No | The paper mentions software components like "PyTorch" and "SAC (Soft Actor-Critic)" as well as the "B-Pref" framework, but it does not provide specific version numbers for these or any other ancillary software components required for replication. |
| Experiment Setup | Yes | QPA, PEBBLE, and SURF all employ SAC (Soft Actor-Critic) (Haarnoja et al., 2018) for policy learning and share the same hyperparameters of SAC. We provide the full list of hyperparameters of SAC in Table 1. Both QPA and SURF utilize PEBBLE as the off-policy Pb RL backbone algorithm and share the same hyperparameters of PEBBLE as listed in Table 2. The additional hyperparameters of SURF based on PEBBLE are set according to their paper and are listed in Table 3. The additional hyperparameters of QPA are presented in Table 4. |