Imitation Learning via Off-Policy Distribution Matching

Authors: Ilya Kostrikov, Ofir Nachum, Jonathan Tompson

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Value DICE on a suite of popular imitation learning benchmarks, finding that it can achieve state-of-the-art sample efficiency and performance.1...We evaluate Value DICE in a variety of settings, starting with a simple synthetic task before continuing to an evaluation on a suite of Mu Jo Co benchmarks.
Researcher Affiliation Collaboration Ilya Kostrikov , Ofir Nachum, Jonathan Tompson Google Research {kostrikov, ofirnachum, tompson}@google.com...Also at NYU.
Pseudocode Yes Please see the appendix for a full pseudocode implementation of Value DICE.
Open Source Code Yes Code to reproduce our results is available at https://github.com/google-research/ google-research/tree/master/value_dice.
Open Datasets Yes We evaluate the algorithms on the standard Mu Jo Co environments using expert demonstrations from Ho & Ermon (2016).
Dataset Splits No The paper does not provide specific train/validation/test dataset splits, but rather describes using expert demonstrations for learning and evaluating policies in a simulated environment.
Hardware Specification No The paper mentions 'networks with an MLP architecture' but provides no specific details about the hardware (e.g., CPU, GPU models, memory) used for experiments.
Software Dependencies No The paper mentions using the 'Adam optimizer' and specific regularization techniques, but it does not provide specific version numbers for any software libraries or dependencies.
Experiment Setup Yes All algorithms use networks with an MLP architecture with 2 hidden layers and 256 hidden units. For discriminators, critic, ν we use Adam optimizer with learning rate 10 3 while for the actors we use the learning rate of 10 5. For the discriminator and ν networks we use gradient penalties from Gulrajani et al. (2017). We also regularize the actor network with the orthogonal regularization (Brock et al., 2018) with a coefficient 10 4. Also we perform 4 updates per 1 environment step. We handle absorbing states of the environments similarly to Kostrikov et al. (2019).