State-only Imitation with Transition Dynamics Mismatch
Authors: Tanmay Gangwani, Jian Peng
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test the efficacy of our algorithm with continuous-control locomotion tasks from Mu Jo Co. Figure 1a depicts one example of the dynamics mismatch which we evaluate in our experiments. For the Ant agent, an expert walking policy π e is trained under the default dynamics provided in the Open AI Gym, T exp = Earth. The dynamics under which to learn the imitator policy are curated by modifying the gravity parameter to half its default value (i.e. 9.81 / 2 ), T pol = Planet X. Figure 1b plots the average episodic returns of π e in the original and modified environments, and proves that direct policy transfer is infeasible. For Figure 1c, we just assume access to state-only expert demonstrations from π e, and do IL with the GAIL algorithm. GAIL performs well if the imitator policy is learned in the same environment as the expert (T exp = T pol = Earth), but does not succeed under mismatched transition dynamics, (T exp = Earth, T pol = Planet X). In our experiments section, we consider other sources of dynamics mismatch as well, such as agent-density and joint-friction. We show that I2L trains much better policies than baseline IL algorithms in these tasks, leading to successful transfer of expert skills to an imitator in an environment dissimilar to the expert. |
| Researcher Affiliation | Academia | Tanmay Gangwani Department of Computer Science University of Illinois, Urbana-Champaign gangwan2@illinois.edu Jian Peng Department of Computer Science University of Illinois, Urbana-Champaign jianpeng@illinois.edu |
| Pseudocode | Yes | Algorithm 1: Indirect Imitation Learning (I2L) |
| Open Source Code | Yes | Code for this paper is available at https://github.com/tgangwani/RL-Indirect-imitation |
| Open Datasets | Yes | We test the efficacy of our algorithm with continuous-control locomotion tasks from Mu Jo Co. Figure 1a depicts one example of the dynamics mismatch which we evaluate in our experiments. For the Ant agent, an expert walking policy π e is trained under the default dynamics provided in the Open AI Gym, T exp = Earth. |
| Dataset Splits | No | The paper does not explicitly provide training/validation/test dataset splits in terms of percentages, sample counts, or specific predefined splits. It discusses training policies in environments and using expert demonstrations, but not data partitioning for model training/evaluation. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions several algorithms and frameworks like PPO, AIRL, WGANs, MuJoCo, and OpenAI Gym, but it does not specify version numbers for these or any other software dependencies required for reproducibility. |
| Experiment Setup | Yes | Hyper-parameter Value Wasserstein critic φ network 3 layers, 64 hidden, tanh Discriminator ω network 3 layers, 64 hidden, tanh Policy θ network 3 layers, 64 hidden, tanh Wasserstein critic φ optimizer, lr, gradient-steps RMS-Prop, 5e-5, 20 Discriminator ω optimizer, lr, gradient-steps Adam, 3e-4, 5 Policy θ algorithm, lr PPO (clipped ratio), 1e-4 Number of state-only expert demonstrations 1 (1000 states) Buffer B capacity 5 trajectories γ, λ (GAE) 0.99, 0.95 |