Coherent Soft Imitation Learning
Authors: Joe Watson, Sandy Huang, Nicolas Heess
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Experimental results We evaluate CSIL against baseline methods on tabular and continuous state-action environments. The baselines are popular entropy-regularized imitation learning methods discussed in Section 2. Moreover, ablation studies are provided in Appendix N for the experiments in Section 5.2 and 5.3 |
| Researcher Affiliation | Collaboration | Joe Watson Sandy H. Huang Nicolas Heess Google Deep Mind London, United Kingdom {shhuang,heess}@google.com TU Darmstadt Darmstadt, Germany joe@robot-learning.de Systems AI for Robot Learning German Research Center for AI dfki.de |
| Pseudocode | Yes | Algorithm 1: Coherent soft imitation learning (CSIL) |
| Open Source Code | Yes | For the open-source implementation and simulation results, see joemwatson.github.io/csil. |
| Open Datasets | Yes | A standard benchmark of deep imitation learning is learning Mu Jo Co [72] Gym [73] and Adroit [74] tasks from agent demonstrations. |
| Dataset Splits | No | The paper describes using various datasets (e.g., Gym, Adroit, robomimic) and discusses combining demonstration data with online/offline data sources. However, it does not provide specific numerical details (percentages or counts) for train, validation, or test dataset splits. |
| Hardware Specification | Yes | Our learner (policy evaluation and improvement) runs on a single TPU v2. We ran four actors to interact with the environment. Depending on the algorithm, there were also one or more evaluators. For vision-based tasks, we used A100 GPUs for the vision-based policies. |
| Software Dependencies | No | The paper mentions using 'jax automatic differentiation and linear algebra library', 'acme', and implementations based on 'PyTorch' (in references), but it does not specify concrete version numbers for these software components to ensure reproducibility (e.g., JAX version X.Y, Acme version A.B). |
| Experiment Setup | Yes | The policy and critic networks were comprised of two layers with 256 units and ELU activations. Learning rates were 3e-4, the batch size was 256, and the target network smoothing coefficient was 0.005. |