Confidence-Aware Imitation Learning from Demonstrations with Varying Optimality

Authors: Songyuan Zhang, ZHANGJIE CAO, Dorsa Sadigh, Yanan Sui

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide theoretical guarantees on the convergence of CAIL and evaluate its performance in both simulated and real robot experiments. Our results show that CAIL significantly outperforms other imitation learning methods from demonstrations with varying optimality.
Researcher Affiliation Academia Songyuan Zhang1 , Zhangjie Cao2 , Dorsa Sadigh2, Yanan Sui1 1National Engineering Lab for Neuromodulation, SAE, Tsinghua University, China 2Department of Computer Science, Stanford University, USA szhang21@mit.edu, {caozj,dorsa}@cs.stanford.edu, ysui@tsinghua.edu.cn
Pseudocode No The paper describes its optimization process with equations and textual descriptions but does not provide a formally labeled pseudocode block or algorithm.
Open Source Code No The code is available on our website3. The provided link (https://sites.google.com/view/cail) is to a general project website and not a direct link to a source-code repository (e.g., GitHub, GitLab).
Open Datasets Yes We conduct experiments in four environments including two Mu Jo Co environments (Reacher and Ant) [28] in Open AI Gym [7], one Franka Panda Arm4 simulation environment, and one real robot environment with a UR5e robot arm5.
Dataset Splits Yes In our implementation, we use a limited amount of ranked demonstrations as our evaluation data for the outer loss...We label only 5% of the demonstrated trajectories with rankings since we target realistic settings where only a small number of rankings are available for the demonstrations.
Hardware Specification No The paper mentions simulated environments (MuJoCo, Franka Panda Arm) and a real robot arm (UR5e), but does not specify any hardware used for computation (e.g., CPU, GPU models, or memory specifications).
Software Dependencies No For the RL algorithm, we use SAC [16] for the Reacher environment and PPO [24] for the Ant environment. These are algorithms, but specific software libraries or their version numbers are not provided.
Experiment Setup Yes We collect 200 trajectories in total, where each trajectory has 50 interaction steps...We collect trajectories with 200,000 interaction steps in total...We label only 5% of the demonstrated trajectories with rankings...In all the experiments, we use ϵ = 10 5.