Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Imitation Learning via Off-Policy Distribution Matching
Authors: Ilya Kostrikov, Ofir Nachum, Jonathan Tompson
ICLR 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Value DICE on a suite of popular imitation learning benchmarks, finding that it can achieve state-of-the-art sample efficiency and performance.1...We evaluate Value DICE in a variety of settings, starting with a simple synthetic task before continuing to an evaluation on a suite of Mu Jo Co benchmarks. |
| Researcher Affiliation | Collaboration | Ilya Kostrikov , Ofir Nachum, Jonathan Tompson Google Research EMAIL...Also at NYU. |
| Pseudocode | Yes | Please see the appendix for a full pseudocode implementation of Value DICE. |
| Open Source Code | Yes | Code to reproduce our results is available at https://github.com/google-research/ google-research/tree/master/value_dice. |
| Open Datasets | Yes | We evaluate the algorithms on the standard Mu Jo Co environments using expert demonstrations from Ho & Ermon (2016). |
| Dataset Splits | No | The paper does not provide specific train/validation/test dataset splits, but rather describes using expert demonstrations for learning and evaluating policies in a simulated environment. |
| Hardware Specification | No | The paper mentions 'networks with an MLP architecture' but provides no specific details about the hardware (e.g., CPU, GPU models, memory) used for experiments. |
| Software Dependencies | No | The paper mentions using the 'Adam optimizer' and specific regularization techniques, but it does not provide specific version numbers for any software libraries or dependencies. |
| Experiment Setup | Yes | All algorithms use networks with an MLP architecture with 2 hidden layers and 256 hidden units. For discriminators, critic, ν we use Adam optimizer with learning rate 10 3 while for the actors we use the learning rate of 10 5. For the discriminator and ν networks we use gradient penalties from Gulrajani et al. (2017). We also regularize the actor network with the orthogonal regularization (Brock et al., 2018) with a coefficient 10 4. Also we perform 4 updates per 1 environment step. We handle absorbing states of the environments similarly to Kostrikov et al. (2019). |