Reward-Constrained Behavior Cloning

Authors: Zhaorong Wang, Meng Wang, Jingqi Zhang, Yingfeng Chen, Chongjie Zhang

IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we aim at investigating (i) whether RCBC can learn from demonstration and avoid undesirable behaviors, and (ii) whether RCBC can obtain high returns while preserving the human-like style. To evaluate our method, we conduct extensive experiments on several widely used environments: a navigation task in Grid-World, two environments in Mu Jo Co [Todorov et al., 2012], and finally, a more complex racing environment named TORCS [Wymann et al., 2021].
Researcher Affiliation Collaboration 1Net Ease Fuxi AI Lab, Hangzhou, China 2School of Computer Science and Technology, Xi an Jiaotong University 3MMW, Tsinghua University {wangzhaorong, wangmeng02, chenyingfeng1}@corp.netease.com, zkkomodao@gmail.com, chongjie@tsinghua.edu.cn
Pseudocode Yes The pseudo-code for Constrain2 is provided in Appendix B.
Open Source Code No The paper does not provide a specific link to the source code for the methodology described in the paper. It links to the full text and a video, but not code.
Open Datasets Yes To evaluate our method, we conduct extensive experiments on several widely used environments: a navigation task in Grid-World, two environments in Mu Jo Co [Todorov et al., 2012], and finally, a more complex racing environment named TORCS [Wymann et al., 2021].
Dataset Splits No The paper describes generating demonstration trajectories (e.g., 'we provide one demonstration trajectory'). It does not specify standard training/validation/test splits with percentages or counts for a dataset.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models or other system specifications used for running the experiments.
Software Dependencies No The paper mentions software like 'PPO', 'PPO+GAIL', and 'Constrain2' but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup Yes For the second issue, we train the agent with three different threshold values. Here we reparameterize the reward constraint by α which represents the ratio of threshold and the max reward (100). We set α to be 0.8, 0.6, 0.4 in the following experiments corresponding to 80, 60, 40 total rewards, respectively. For the third issue, we load a pre-trained PPO model which can arrive at the optimal goal but pass through the top door and continue optimizing the policy with RCBC to encourage the agent to reach the optimal goal through the bottom door. We also test three different α values (0.8, 0.6, 0.4) in this experiment according to Eq.(7).