End-to-End Differentiable Adversarial Imitation Learning

Authors: Nir Baram, Oron Anschel, Itai Caspi, Shie Mannor

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test it on both discrete and continuous action domains and report results that surpass the state-of-the-art. We evaluate the proposed algorithm on three discrete control tasks (Cartpole, Mountain-Car, Acrobot), and five continuous control tasks (Hopper, Walker, Half-Cheetah, Ant, and Humanoid) modeled by the Mu Jo Co physics simulator (Todorov et al., 2012).
Researcher Affiliation Academia Nir Baram 1 Oron Anschel 1 Itai Caspi 1 Shie Mannor 1 1Technion Institute of Technology, Israel.
Pseudocode Yes Algorithm 1 Model-based Generative Adversarial Imitation Learning
Open Source Code No The paper does not provide an explicit statement or link to open-source code for the proposed methodology.
Open Datasets Yes We evaluate the proposed algorithm on three discrete control tasks (Cartpole, Mountain-Car, Acrobot), and five continuous control tasks (Hopper, Walker, Half-Cheetah, Ant, and Humanoid) modeled by the Mu Jo Co physics simulator (Todorov et al., 2012).
Dataset Splits No The paper describes generating trajectories but does not specify explicit train/validation/test dataset splits for reproducibility.
Hardware Specification No The paper does not provide specific details about the hardware used for running experiments.
Software Dependencies No The paper mentions using TRPO, MuJoCo, and ADAM optimizer, but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes The discriminator and policy neural networks are built from two hidden layers with Relu non-linearity and are trained using the ADAM optimizer (Kingma & Ba, 2014). For each task, we produce datasets with a different number of trajectories, where each trajectory: τ = {s0, s1, ...s N, a N} is of length N = 1000. We found empirically that using a Hadamard product to combine the encoded state and action achieves the best performance. Additionally, predicting the next state based on the current state alone requires the environment to be representable as a first order MDP. Instead, we can assume the environment to be representable as an n th order MDP and use multiple previous states to predict the next state. To model the multi-step dependencies, we use a recurrent connection from the previous state by incorporating a GRU layer (Cho et al., 2014) as part of the state encoder.