reproducibilityindex.ai

Variational Imitation Learning with Diverse-quality Demonstrations

Authors: Voot Tangkaratt, Bo Han, Mohammad Emtiyaz Khan, Masashi Sugiyama

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our method is easy to implement within reinforcement-learning frameworks and also achieves state-of-the-art performance on continuous-control benchmarks. Our work enables scalable and data-efﬁcient imitation learning under more realistic settings than before. We experimentally evaluate VILD (with IS and without IS) in continuous-control tasks. Performance is evaluated using a cumulative ground-truth reward along trajectories collected by policies (Ho & Ermon, 2016). We report the mean and standard error computed over 5 trials.
Researcher Affiliation	Academia	1RIKEN Center for Advanced Intelligence Project, Japan 2Department of Computer Science, Hong Kong Baptist University, Hong Kong 3Department of Complexity Science and Engineering, The University of Tokyo, Japan.
Pseudocode	Yes	Algorithm 1 shows the pseudo-code of VILD.
Open Source Code	Yes	Source code: www.github.com/voot-t/vild_code
Open Datasets	Yes	We consider a Lunar Lander task, where an optimal policy is available for generating high-quality demonstrations (Brockman et al., 2016). Lastly, we evaluate the robustness of VILD against real-world demonstrations collected by crowdsourcing (Mandlekar et al., 2018). While the public datasets were collected for Assembly tasks in a Robosuite platform (Fan et al., 2018), we consider a Reacher task, where demonstrations in Assembly tasks are clipped when the robot s end-effector contacts the object. We use a Reacher dataset with approximately 5000 state-action pairs.
Dataset Splits	No	The paper does not explicitly provide training/validation/test dataset splits with percentages or sample counts.
Hardware Specification	No	The paper does not explicitly describe the specific hardware used (e.g., GPU models, CPU models, memory) to run its experiments.
Software Dependencies	No	We use TRPO (Schulman et al., 2015) as an RL method, except on the Humanoid task where we use SAC (Haarnoja et al., 2018) since TRPO does not perform well. We use PPO (Schulman et al., 2017) as an RL method. VILD solves Eq. (13) to learn policy qθpat\|stq, where θ is optimized by RL with reward rφ, while φ, ω, and ψ are optimized by stochastic gradient methods such as Adam (Kingma & Ba, 2015). The paper mentions software components like TRPO, SAC, PPO, and Adam, but does not specify their version numbers.
Experiment Setup	Yes	Similarly to prior works (Ho & Ermon, 2016), we implement VILD using feed-forward neural networks with two hidden-layers and use Monte-Carlo estimation to approximate expectations. We also pre-train the Gaussian mean of qψ to obtain reasonable initial predictions; We perform least-squares regression for 1000 gradient steps with target value ut. More implementation details are given in Appendix C3.