reproducibilityindex.ai

State-only Imitation with Transition Dynamics Mismatch

Authors: Tanmay Gangwani, Jian Peng

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We test the efﬁcacy of our algorithm with continuous-control locomotion tasks from Mu Jo Co. Figure 1a depicts one example of the dynamics mismatch which we evaluate in our experiments. For the Ant agent, an expert walking policy π e is trained under the default dynamics provided in the Open AI Gym, T exp = Earth. The dynamics under which to learn the imitator policy are curated by modifying the gravity parameter to half its default value (i.e. 9.81 / 2 ), T pol = Planet X. Figure 1b plots the average episodic returns of π e in the original and modiﬁed environments, and proves that direct policy transfer is infeasible. For Figure 1c, we just assume access to state-only expert demonstrations from π e, and do IL with the GAIL algorithm. GAIL performs well if the imitator policy is learned in the same environment as the expert (T exp = T pol = Earth), but does not succeed under mismatched transition dynamics, (T exp = Earth, T pol = Planet X). In our experiments section, we consider other sources of dynamics mismatch as well, such as agent-density and joint-friction. We show that I2L trains much better policies than baseline IL algorithms in these tasks, leading to successful transfer of expert skills to an imitator in an environment dissimilar to the expert.
Researcher Affiliation	Academia	Tanmay Gangwani Department of Computer Science University of Illinois, Urbana-Champaign gangwan2@illinois.edu Jian Peng Department of Computer Science University of Illinois, Urbana-Champaign jianpeng@illinois.edu
Pseudocode	Yes	Algorithm 1: Indirect Imitation Learning (I2L)
Open Source Code	Yes	Code for this paper is available at https://github.com/tgangwani/RL-Indirect-imitation
Open Datasets	Yes	We test the efﬁcacy of our algorithm with continuous-control locomotion tasks from Mu Jo Co. Figure 1a depicts one example of the dynamics mismatch which we evaluate in our experiments. For the Ant agent, an expert walking policy π e is trained under the default dynamics provided in the Open AI Gym, T exp = Earth.
Dataset Splits	No	The paper does not explicitly provide training/validation/test dataset splits in terms of percentages, sample counts, or specific predefined splits. It discusses training policies in environments and using expert demonstrations, but not data partitioning for model training/evaluation.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions several algorithms and frameworks like PPO, AIRL, WGANs, MuJoCo, and OpenAI Gym, but it does not specify version numbers for these or any other software dependencies required for reproducibility.
Experiment Setup	Yes	Hyper-parameter Value Wasserstein critic φ network 3 layers, 64 hidden, tanh Discriminator ω network 3 layers, 64 hidden, tanh Policy θ network 3 layers, 64 hidden, tanh Wasserstein critic φ optimizer, lr, gradient-steps RMS-Prop, 5e-5, 20 Discriminator ω optimizer, lr, gradient-steps Adam, 3e-4, 5 Policy θ algorithm, lr PPO (clipped ratio), 1e-4 Number of state-only expert demonstrations 1 (1000 states) Buffer B capacity 5 trajectories γ, λ (GAE) 0.99, 0.95