Imitation Learning via Off-Policy Distribution Matching
Authors: Ilya Kostrikov, Ofir Nachum, Jonathan Tompson
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Value DICE on a suite of popular imitation learning benchmarks, finding that it can achieve state-of-the-art sample efficiency and performance.1...We evaluate Value DICE in a variety of settings, starting with a simple synthetic task before continuing to an evaluation on a suite of Mu Jo Co benchmarks. |
| Researcher Affiliation | Collaboration | Ilya Kostrikov , Ofir Nachum, Jonathan Tompson Google Research {kostrikov, ofirnachum, tompson}@google.com...Also at NYU. |
| Pseudocode | Yes | Please see the appendix for a full pseudocode implementation of Value DICE. |
| Open Source Code | Yes | Code to reproduce our results is available at https://github.com/google-research/ google-research/tree/master/value_dice. |
| Open Datasets | Yes | We evaluate the algorithms on the standard Mu Jo Co environments using expert demonstrations from Ho & Ermon (2016). |
| Dataset Splits | No | The paper does not provide specific train/validation/test dataset splits, but rather describes using expert demonstrations for learning and evaluating policies in a simulated environment. |
| Hardware Specification | No | The paper mentions 'networks with an MLP architecture' but provides no specific details about the hardware (e.g., CPU, GPU models, memory) used for experiments. |
| Software Dependencies | No | The paper mentions using the 'Adam optimizer' and specific regularization techniques, but it does not provide specific version numbers for any software libraries or dependencies. |
| Experiment Setup | Yes | All algorithms use networks with an MLP architecture with 2 hidden layers and 256 hidden units. For discriminators, critic, ν we use Adam optimizer with learning rate 10 3 while for the actors we use the learning rate of 10 5. For the discriminator and ν networks we use gradient penalties from Gulrajani et al. (2017). We also regularize the actor network with the orthogonal regularization (Brock et al., 2018) with a coefficient 10 4. Also we perform 4 updates per 1 environment step. We handle absorbing states of the environments similarly to Kostrikov et al. (2019). |