Coherent Soft Imitation Learning

Authors: Joe Watson, Sandy Huang, Nicolas Heess

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 Experimental results We evaluate CSIL against baseline methods on tabular and continuous state-action environments. The baselines are popular entropy-regularized imitation learning methods discussed in Section 2. Moreover, ablation studies are provided in Appendix N for the experiments in Section 5.2 and 5.3
Researcher Affiliation Collaboration Joe Watson Sandy H. Huang Nicolas Heess Google Deep Mind London, United Kingdom {shhuang,heess}@google.com TU Darmstadt Darmstadt, Germany joe@robot-learning.de Systems AI for Robot Learning German Research Center for AI dfki.de
Pseudocode Yes Algorithm 1: Coherent soft imitation learning (CSIL)
Open Source Code Yes For the open-source implementation and simulation results, see joemwatson.github.io/csil.
Open Datasets Yes A standard benchmark of deep imitation learning is learning Mu Jo Co [72] Gym [73] and Adroit [74] tasks from agent demonstrations.
Dataset Splits No The paper describes using various datasets (e.g., Gym, Adroit, robomimic) and discusses combining demonstration data with online/offline data sources. However, it does not provide specific numerical details (percentages or counts) for train, validation, or test dataset splits.
Hardware Specification Yes Our learner (policy evaluation and improvement) runs on a single TPU v2. We ran four actors to interact with the environment. Depending on the algorithm, there were also one or more evaluators. For vision-based tasks, we used A100 GPUs for the vision-based policies.
Software Dependencies No The paper mentions using 'jax automatic differentiation and linear algebra library', 'acme', and implementations based on 'PyTorch' (in references), but it does not specify concrete version numbers for these software components to ensure reproducibility (e.g., JAX version X.Y, Acme version A.B).
Experiment Setup Yes The policy and critic networks were comprised of two layers with 256 units and ELU activations. Learning rates were 3e-4, the batch size was 256, and the target network smoothing coefficient was 0.005.