MobILE: Model-Based Imitation Learning From Observation Alone

Authors: Rahul Kidambi, Jonathan Chang, Wen Sun

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We complement these theoretical results with experimental simulations on benchmark Open AI Gym tasks that indicate the efficacy of Mob ILE. Code for implementing the Mob ILE framework is available at https://github.com/rahulkidambi/Mob ILE-Neur IPS2021.
Researcher Affiliation Collaboration Rahul Kidambi Amazon Search & AI Berkeley CA 94704. rk773@cornell.edu Jonathan D. Chang CS Department, Cornell University Ithaca NY 14853. jdc396@cornell.edu Wen Sun CS Department, Cornell University Ithaca NY 14853. ws455@cornell.edu Work initiated when RK was a post-doc at Cornell University; work done outside Amazon.
Pseudocode Yes Algorithm 1 Mob ILE: The framework of Model-based Imitation Learning and Exploring for ILFO
Open Source Code Yes Code for implementing the Mob ILE framework is available at https://github.com/rahulkidambi/Mob ILE-Neur IPS2021.
Open Datasets Yes We consider tasks from Open AI Gym [8] simulated with Mujoco [62]: Cartpole-v1, Reacher-v2, Swimmer-v2, Hopper-v2 and Walker2d-v2.
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits (e.g., percentages or sample counts). It refers to 'expert trajectories' and 'online samples' without specifying how they are divided into splits for training, validation, or testing.
Hardware Specification No The paper mentions 'Mujoco' for simulation but does not specify any hardware details like GPU/CPU models or memory used for running the experiments.
Software Dependencies No The paper mentions software like 'Open AI Gym', 'Mujoco', 'TRPO', 'SGD', 'MLP', 'MMD', and 'RBF kernel' but does not provide specific version numbers for any of these components.
Experiment Setup Yes We employ Gaussian Dynamics Models parameterized by an MLP [48, 32], i.e., b P(s, a) := N(hθ(s, a), σ2I), where, hθ(s, a) = s + σ s MLPθ(sc, ac), where, θ are MLP’s trainable parameters, sc = (s µs)/σs, ac = (a µa)/σa with µs, µa (and σs, σa) being the mean of states, actions (and standard deviation of states and actions) in the replay buffer D. Next, for (s, a, s ) D, s = s s and σ s is the standard deviation of the state differences s D. We use SGD with momentum [60] for training the parameters θ of the MLP. Discriminator parameterization:We utilize MMD as our choice of IPM and define the discriminator as f(s) = w ψ(s), where, ψ(s) are Random Fourier Features [46]. Bonus parameterization:We utilize the discrepancy between predictions of a pair of dynamics models hθ1(s, a) and hθ2(s, a) for designing the bonus. Empirically, we found that using more than two models in the ensemble offered little to no improvements. Denote the disagreement at any (s, a) as δ(s, a) = hθ1(s, a) hθ2(s, a) 2, and δD = max(s,a) D δ(s, a) is the max discrepancy of a replay buffer D. We set bonus as b(s, a) = λ min(δ(s, a)/δD, where λ > 0 is a tunable parameter. PG oracle:We use TRPO [54] to perform incremental policy optimization inside the learned model. We train an expert for each task using TRPO [54] until we obtain an expert policy of average value 460, 10, 38, 3000, 2000 respectively. We setup Swimmer-v2, Hopper-v2,Walker2d-v2 similar to prior model-based RL works [33, 39, 38, 48, 32]. The learning curves are obtained by averaging all algorithms over 5 seeds. We utilize 10 expert trajectories for all environments except Swimmer-v2; this is because all algorithms (including Mob ILE) present results with high variance.