Imitation Learning from Imperfection: Theoretical Justifications and Algorithms
Authors: Ziniu Li, Tian Xu, Zeyu Qin, Yang Yu, Zhi-Quan Luo
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical studies demonstrate that our method outperforms previous state-of-the-art methods in tasks including robotic locomotion control, Atari video games, and image classification. |
| Researcher Affiliation | Collaboration | Ziniu Li 1,2, Tian Xu 3,4, Zeyu Qin 5, Yang Yu 3,4, and Zhi-Quan Luo 1,2 1The Chinese University of Hong Kong, Shenzhen 2Shenzhen Research Institute of Big Data 3National Key Laboratory for Novel Software Technology, Nanjing University 4Polixir.ai 5Hong Kong University of Science and Technology |
| Pseudocode | Yes | Algorithm 1 ISW-BC |
| Open Source Code | Yes | 1The code is available at https://github.com/liziniu/ISWBC. |
| Open Datasets | Yes | We use the replay buffer data from an online DQN agent, which is publicly available at https: //console.cloud.google.com/storage/browser/atari-replay-datasets, thanks to the work of [2]. We use a famous dataset, Domain Net [36]. |
| Dataset Splits | No | The paper specifies a train/test split for image classification ("80% for training and 20% for testing") but does not explicitly mention or detail a separate validation set split across its experiments. |
| Hardware Specification | Yes | The experiments are conducted on a machine comprising 48 CPU cores and 4 V100 GPU cores. |
| Software Dependencies | No | The paper mentions software like "rlkit codebase", "Adam optimizer", "Res Net-18 model", and "CVXPY" but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | We use a 2-hidden-layer multi-layer perceptron (MLP) with hidden size 256 and Re LU activation for all algorithms... We use a batch size of 256 and Adam optimizer with a learning rate of 0.0003 for training both networks. The training process is carried out for 1 million iterations. We set δ to 0 and use a gradient penalty coefficient of 8 by default. |