End-to-End Differentiable Adversarial Imitation Learning
Authors: Nir Baram, Oron Anschel, Itai Caspi, Shie Mannor
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test it on both discrete and continuous action domains and report results that surpass the state-of-the-art. We evaluate the proposed algorithm on three discrete control tasks (Cartpole, Mountain-Car, Acrobot), and five continuous control tasks (Hopper, Walker, Half-Cheetah, Ant, and Humanoid) modeled by the Mu Jo Co physics simulator (Todorov et al., 2012). |
| Researcher Affiliation | Academia | Nir Baram 1 Oron Anschel 1 Itai Caspi 1 Shie Mannor 1 1Technion Institute of Technology, Israel. |
| Pseudocode | Yes | Algorithm 1 Model-based Generative Adversarial Imitation Learning |
| Open Source Code | No | The paper does not provide an explicit statement or link to open-source code for the proposed methodology. |
| Open Datasets | Yes | We evaluate the proposed algorithm on three discrete control tasks (Cartpole, Mountain-Car, Acrobot), and five continuous control tasks (Hopper, Walker, Half-Cheetah, Ant, and Humanoid) modeled by the Mu Jo Co physics simulator (Todorov et al., 2012). |
| Dataset Splits | No | The paper describes generating trajectories but does not specify explicit train/validation/test dataset splits for reproducibility. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running experiments. |
| Software Dependencies | No | The paper mentions using TRPO, MuJoCo, and ADAM optimizer, but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | The discriminator and policy neural networks are built from two hidden layers with Relu non-linearity and are trained using the ADAM optimizer (Kingma & Ba, 2014). For each task, we produce datasets with a different number of trajectories, where each trajectory: τ = {s0, s1, ...s N, a N} is of length N = 1000. We found empirically that using a Hadamard product to combine the encoded state and action achieves the best performance. Additionally, predicting the next state based on the current state alone requires the environment to be representable as a first order MDP. Instead, we can assume the environment to be representable as an n th order MDP and use multiple previous states to predict the next state. To model the multi-step dependencies, we use a recurrent connection from the previous state by incorporating a GRU layer (Cho et al., 2014) as part of the state encoder. |