reproducibilityindex.ai

Reward-Constrained Behavior Cloning

Authors: Zhaorong Wang, Meng Wang, Jingqi Zhang, Yingfeng Chen, Chongjie Zhang

IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we aim at investigating (i) whether RCBC can learn from demonstration and avoid undesirable behaviors, and (ii) whether RCBC can obtain high returns while preserving the human-like style. To evaluate our method, we conduct extensive experiments on several widely used environments: a navigation task in Grid-World, two environments in Mu Jo Co [Todorov et al., 2012], and ﬁnally, a more complex racing environment named TORCS [Wymann et al., 2021].
Researcher Affiliation	Collaboration	1Net Ease Fuxi AI Lab, Hangzhou, China 2School of Computer Science and Technology, Xi an Jiaotong University 3MMW, Tsinghua University {wangzhaorong, wangmeng02, chenyingfeng1}@corp.netease.com, zkkomodao@gmail.com, chongjie@tsinghua.edu.cn
Pseudocode	Yes	The pseudo-code for Constrain2 is provided in Appendix B.
Open Source Code	No	The paper does not provide a specific link to the source code for the methodology described in the paper. It links to the full text and a video, but not code.
Open Datasets	Yes	To evaluate our method, we conduct extensive experiments on several widely used environments: a navigation task in Grid-World, two environments in Mu Jo Co [Todorov et al., 2012], and ﬁnally, a more complex racing environment named TORCS [Wymann et al., 2021].
Dataset Splits	No	The paper describes generating demonstration trajectories (e.g., 'we provide one demonstration trajectory'). It does not specify standard training/validation/test splits with percentages or counts for a dataset.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models or other system specifications used for running the experiments.
Software Dependencies	No	The paper mentions software like 'PPO', 'PPO+GAIL', and 'Constrain2' but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup	Yes	For the second issue, we train the agent with three different threshold values. Here we reparameterize the reward constraint by α which represents the ratio of threshold and the max reward (100). We set α to be 0.8, 0.6, 0.4 in the following experiments corresponding to 80, 60, 40 total rewards, respectively. For the third issue, we load a pre-trained PPO model which can arrive at the optimal goal but pass through the top door and continue optimizing the policy with RCBC to encourage the agent to reach the optimal goal through the bottom door. We also test three different α values (0.8, 0.6, 0.4) in this experiment according to Eq.(7).