Self-Adaptive Imitation Learning: Learning Tasks with Delayed Rewards from Sub-optimal Demonstrations
Authors: Zhuangdi Zhu, Kaixiang Lin, Bo Dai, Jiayu Zhou9269-9277
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive empirical results show that not only does SAIL significantly improve the sample efficiency, but it also leads to higher asymptotic performance across different continuous control tasks, compared with the state-of-the-art. In this section, we study how SAIL achieves the objective of imitation learning and exploration in an environment with delayed rewards. Extensive experiments have been conducted to answer the following key questions: |
| Researcher Affiliation | Collaboration | Zhuangdi Zhu,1 Kaixiang Lin, 1 Bo Dai, 2 Jiayu Zhou 1 1 Michigan State University 2 Google Brain |
| Pseudocode | Yes | Algorithm 1: Self-Adaptive Imitation Learning |
| Open Source Code | No | The paper does not provide an explicit statement about open-sourcing their code or a direct link to their implementation of SAIL. It references third-party libraries: "we built SAIL on a TD3 framework (Fujimoto, Van Hoof, and Meger 2018) based on stable-baselines1 implementations." with footnote 1 pointing to https://stable-baselines.readthedocs.io/en/master/. |
| Open Datasets | Yes | It is tested on 4 popular Mu Jo Co2 tasks: Walker2d-v2, Hopper-v2, Half Cheetah-v2, and Swimmerv2. For each task, we generate teacher demonstrations from a deterministic policy that was pre-trained to be sub-optimal. |
| Dataset Splits | No | The paper mentions that "All experiments are conducted using one imperfect demonstration trajectory on five random seeds" and "Models are evaluated after training using 10^6 interaction samples." However, it does not specify explicit train/validation/test splits of a dataset in percentages or counts for reproducibility, which is common in supervised learning. For RL, evaluation is typically done by running policies in the environment. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for running the experiments. |
| Software Dependencies | No | The paper states: "we built SAIL on a TD3 framework (Fujimoto, Van Hoof, and Meger 2018) based on stable-baselines1 implementations." While it mentions 'stable-baselines', it does not provide a specific version number for this library or any other software dependencies like Python, PyTorch/TensorFlow, or CUDA, which are necessary for full reproducibility. |
| Experiment Setup | No | The paper mentions: "All experiments are conducted using one imperfect demonstration trajectory on five random seeds, with each trajectory containing no more than 1000 transitions. Models are evaluated after training using 10^6 interaction samples." While it provides some setup details like the number of random seeds and interaction samples, it lacks crucial hyperparameters such as learning rates, batch sizes, optimizer details, or specific network architectures for full reproducibility. |