Variational Imitation Learning with Diverse-quality Demonstrations

Authors: Voot Tangkaratt, Bo Han, Mohammad Emtiyaz Khan, Masashi Sugiyama

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method is easy to implement within reinforcement-learning frameworks and also achieves state-of-the-art performance on continuous-control benchmarks. Our work enables scalable and data-efficient imitation learning under more realistic settings than before. We experimentally evaluate VILD (with IS and without IS) in continuous-control tasks. Performance is evaluated using a cumulative ground-truth reward along trajectories collected by policies (Ho & Ermon, 2016). We report the mean and standard error computed over 5 trials.
Researcher Affiliation Academia 1RIKEN Center for Advanced Intelligence Project, Japan 2Department of Computer Science, Hong Kong Baptist University, Hong Kong 3Department of Complexity Science and Engineering, The University of Tokyo, Japan.
Pseudocode Yes Algorithm 1 shows the pseudo-code of VILD.
Open Source Code Yes Source code: www.github.com/voot-t/vild_code
Open Datasets Yes We consider a Lunar Lander task, where an optimal policy is available for generating high-quality demonstrations (Brockman et al., 2016). Lastly, we evaluate the robustness of VILD against real-world demonstrations collected by crowdsourcing (Mandlekar et al., 2018). While the public datasets were collected for Assembly tasks in a Robosuite platform (Fan et al., 2018), we consider a Reacher task, where demonstrations in Assembly tasks are clipped when the robot s end-effector contacts the object. We use a Reacher dataset with approximately 5000 state-action pairs.
Dataset Splits No The paper does not explicitly provide training/validation/test dataset splits with percentages or sample counts.
Hardware Specification No The paper does not explicitly describe the specific hardware used (e.g., GPU models, CPU models, memory) to run its experiments.
Software Dependencies No We use TRPO (Schulman et al., 2015) as an RL method, except on the Humanoid task where we use SAC (Haarnoja et al., 2018) since TRPO does not perform well. We use PPO (Schulman et al., 2017) as an RL method. VILD solves Eq. (13) to learn policy qθpat|stq, where θ is optimized by RL with reward rφ, while φ, ω, and ψ are optimized by stochastic gradient methods such as Adam (Kingma & Ba, 2015). The paper mentions software components like TRPO, SAC, PPO, and Adam, but does not specify their version numbers.
Experiment Setup Yes Similarly to prior works (Ho & Ermon, 2016), we implement VILD using feed-forward neural networks with two hidden-layers and use Monte-Carlo estimation to approximate expectations. We also pre-train the Gaussian mean of qψ to obtain reasonable initial predictions; We perform least-squares regression for 1000 gradient steps with target value ut. More implementation details are given in Appendix C3.