Imitation Learning from Observation with Automatic Discount Scheduling

Authors: Yuyang Liu, Weijun Dong, Yingdong Hu, Chuan Wen, Zhao-Heng Yin, Chongjie Zhang, Yang Gao

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments, conducted on nine Meta-World tasks, demonstrate that our method significantly outperforms stateof-the-art methods across all tasks, including those that are unsolvable by them.
Researcher Affiliation Academia Yuyang Liu1,2, , Weijun Dong1,2, , Yingdong Hu1,2, Chuan Wen1,2, Zhao-Heng Yin3, Chongjie Zhang4, Yang Gao1,2,5, 1Institute for Interdisciplinary Information Sciences, Tsinghua University 2Shanghai Qi Zhi Institute 3UC Berkeley 4Washington University in St. Louis 5Shanghai Artificial Intelligence Laboratory
Pseudocode Yes Algorithm 1 Imitation Learning from Observation with Automatic Discount Scheduling
Open Source Code Yes Our code is available at https://il-ads.github.io/. With the code released online and the hyperparameter settings in Appendix A.1, the experiment results are highly reproducible.
Open Datasets Yes We experiment with 9 challenging tasks from the Meta-World (Yu et al., 2020) suite. Instead, the agent is equipped with 10 expert demonstration sequences, which solely comprise observational data.
Dataset Splits No The paper describes reinforcement learning experiments with agents interacting in an environment, and does not specify traditional training/validation/test dataset splits like those found in supervised learning tasks.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory details) used to run its experiments.
Software Dependencies No The paper mentions software like Dr Q-v2, Adam optimizer, and Res Net-50 but does not provide specific version numbers for these components, which are required for reproducible software dependencies.
Experiment Setup Yes The hyperparameters are listed in Table 1. Replay buffer capacity 150000 n-step returns 3 Mini-batch size 512 Discount γ (for baselines) 0.99 Optimizer Adam Learning rate 10 4 Critic Q-function soft-update rate τ 0.005 Hidden dimension 1024 Exploration noise N(0, 0.4) Policy noise clip(N(0, 0.1), 0.3, 0.3) Delayed policy update 1 λ (for progress recognizer Φ) 0.9 α (for mapping function fγ) 0.2