Fighting Copycat Agents in Behavioral Cloning from Observation Histories

Authors: Chuan Wen, Jierui Lin, Trevor Darrell, Dinesh Jayaraman, Yang Gao

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we conduct experiments to evaluate our method against a variety of baselines. We also qualitatively study our method in order to better understand the newly introduced algorithm. (Section 6) and Table 2: Cumulative rewards per episode in partially observed (PO) environments. The top half of the table shows results in our offline imitation setting.
Researcher Affiliation Academia Chuan Wen 1, Jierui Lin 2, Trevor Darrell2, Dinesh Jayaraman3, Yang Gao 124 1Institute for Interdisciplinary Information Sciences, Tsinghua University 2UC Berkeley, 3University of Pennsylvania, 4Shanghai Qi Zhi Institute cwen20@mails.tsinghua.edu.cn, jerrylin0928@berkeley.edu, trevor@eecs.berkeley.edu, dineshj@seas.upenn.edu, gaoyangiiis@tsinghua.edu.cn
Pseudocode Yes More details and pseudocode are in Appendix. (Section 5)
Open Source Code No The paper does not include a direct statement about releasing source code or a link to a code repository for the described methodology.
Open Datasets Yes Motivated by robotic control, we use the six Open AI Gym Mu Jo Co continuous control tasks. (Section 4) and Mu Jo Co [48] control environments from Open AI Gym (Section 6.2).
Dataset Splits No The paper mentions training on 'expert demonstrations' and evaluating on 'held-out trajectories' (Table 1), but it does not provide specific percentages or counts for training, validation, or test splits.
Hardware Specification No The paper does not specify any hardware details such as CPU, GPU, or TPU models used for running the experiments.
Software Dependencies No The paper mentions using Open AI Gym and Mu Jo Co environments, TRPO for expert generation, Adam optimizer, and neural networks (MLPs), but it does not provide specific version numbers for any software dependencies.
Experiment Setup Yes We set the history size H = 2 and train a neural network policy πθ(a| ot = [ot, ot 1]) on expert demonstrations. (Section 4) and In our experiments, the encoder E is a 4-layer MLP, and the decoder F and the adversary D are each 2-layer MLPs. (Section 5)