Learning to Weight Imperfect Demonstrations
Authors: Yunke Wang, Chang Xu, Bo Du, Honglak Lee
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments in the Mujoco and Atari environments demonstrate that the proposed algorithm outperforms baseline methods in handling imperfect expert demonstrations. |
| Researcher Affiliation | Collaboration | 1National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence, School of Computer Science and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, China 2School of Computer Science, Faculty of Engineering, The University of Sydney, Australia 3EECS Department, University of Michingan, USA 4LG AI Research, South Korea. |
| Pseudocode | No | The paper describes the methodology but does not include any explicit pseudocode blocks or clearly labeled algorithm sections. |
| Open Source Code | No | The paper cites a third-party implementation (Kostrikov's PPO) that was used, but does not provide an explicit statement or link to the source code for their proposed method (WGAIL). |
| Open Datasets | Yes | We first conduct experiments on four continuous control tasks in the Mujoco simulator (Todorov et al., 2012): Antv2, Hopper-v2, Walker2d-v2, and Half Cheetah-v2. ... we only evaluate WGAIL on five Atari games Beamrider, Pong, Qbert, Seaquest and Hero with one kind of imperfect demonstrations. |
| Dataset Splits | No | The paper discusses training and testing but does not explicitly mention or specify details about a validation dataset split. |
| Hardware Specification | No | The paper does not provide specific details on the hardware used for running the experiments, such as CPU or GPU models, or memory specifications. |
| Software Dependencies | No | The paper refers to 'Kostrikov’s implementation of PPO' and mentions PyTorch in a citation, but does not provide specific version numbers for software dependencies used in their experiments. |
| Experiment Setup | No | The paper mentions evaluating with 'five different random seeds' and using 'default hyperparameter' from a third-party PPO implementation, but does not provide specific values for hyperparameters or detailed training configurations (e.g., learning rate, batch size, number of epochs) for their own experiments. |