Learning to Weight Imperfect Demonstrations

Authors: Yunke Wang, Chang Xu, Bo Du, Honglak Lee

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments in the Mujoco and Atari environments demonstrate that the proposed algorithm outperforms baseline methods in handling imperfect expert demonstrations.
Researcher Affiliation Collaboration 1National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence, School of Computer Science and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, China 2School of Computer Science, Faculty of Engineering, The University of Sydney, Australia 3EECS Department, University of Michingan, USA 4LG AI Research, South Korea.
Pseudocode No The paper describes the methodology but does not include any explicit pseudocode blocks or clearly labeled algorithm sections.
Open Source Code No The paper cites a third-party implementation (Kostrikov's PPO) that was used, but does not provide an explicit statement or link to the source code for their proposed method (WGAIL).
Open Datasets Yes We first conduct experiments on four continuous control tasks in the Mujoco simulator (Todorov et al., 2012): Antv2, Hopper-v2, Walker2d-v2, and Half Cheetah-v2. ... we only evaluate WGAIL on five Atari games Beamrider, Pong, Qbert, Seaquest and Hero with one kind of imperfect demonstrations.
Dataset Splits No The paper discusses training and testing but does not explicitly mention or specify details about a validation dataset split.
Hardware Specification No The paper does not provide specific details on the hardware used for running the experiments, such as CPU or GPU models, or memory specifications.
Software Dependencies No The paper refers to 'Kostrikov’s implementation of PPO' and mentions PyTorch in a citation, but does not provide specific version numbers for software dependencies used in their experiments.
Experiment Setup No The paper mentions evaluating with 'five different random seeds' and using 'default hyperparameter' from a third-party PPO implementation, but does not provide specific values for hyperparameters or detailed training configurations (e.g., learning rate, batch size, number of epochs) for their own experiments.