Imitation Learning from Observation with Automatic Discount Scheduling
Authors: Yuyang Liu, Weijun Dong, Yingdong Hu, Chuan Wen, Zhao-Heng Yin, Chongjie Zhang, Yang Gao
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments, conducted on nine Meta-World tasks, demonstrate that our method significantly outperforms stateof-the-art methods across all tasks, including those that are unsolvable by them. |
| Researcher Affiliation | Academia | Yuyang Liu1,2, , Weijun Dong1,2, , Yingdong Hu1,2, Chuan Wen1,2, Zhao-Heng Yin3, Chongjie Zhang4, Yang Gao1,2,5, 1Institute for Interdisciplinary Information Sciences, Tsinghua University 2Shanghai Qi Zhi Institute 3UC Berkeley 4Washington University in St. Louis 5Shanghai Artificial Intelligence Laboratory |
| Pseudocode | Yes | Algorithm 1 Imitation Learning from Observation with Automatic Discount Scheduling |
| Open Source Code | Yes | Our code is available at https://il-ads.github.io/. With the code released online and the hyperparameter settings in Appendix A.1, the experiment results are highly reproducible. |
| Open Datasets | Yes | We experiment with 9 challenging tasks from the Meta-World (Yu et al., 2020) suite. Instead, the agent is equipped with 10 expert demonstration sequences, which solely comprise observational data. |
| Dataset Splits | No | The paper describes reinforcement learning experiments with agents interacting in an environment, and does not specify traditional training/validation/test dataset splits like those found in supervised learning tasks. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory details) used to run its experiments. |
| Software Dependencies | No | The paper mentions software like Dr Q-v2, Adam optimizer, and Res Net-50 but does not provide specific version numbers for these components, which are required for reproducible software dependencies. |
| Experiment Setup | Yes | The hyperparameters are listed in Table 1. Replay buffer capacity 150000 n-step returns 3 Mini-batch size 512 Discount γ (for baselines) 0.99 Optimizer Adam Learning rate 10 4 Critic Q-function soft-update rate τ 0.005 Hidden dimension 1024 Exploration noise N(0, 0.4) Policy noise clip(N(0, 0.1), 0.3, 0.3) Delayed policy update 1 λ (for progress recognizer Φ) 0.9 α (for mapping function fγ) 0.2 |