Policy Improvement via Imitation of Multiple Oracles

Authors: Ching-An Cheng, Andrey Kolobov, Alekh Agarwal

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In an evaluation against standard policy gradient with GAE and Aggre Va Te(D), we showcase MAMBA’s ability to leverage demonstrations both from a single and from multiple weak oracles, and significantly speed up policy optimization. We corroborate our theoretical discoveries with simulations of IL from multiple oracles. We compare MAMBA with two representative algorithms: GAE Policy Gradient [17] (PG-GAE with λ = 0.9) for direct RL and Aggre Va Te D [14] for IL with a single oracle.
Researcher Affiliation Industry Ching-An Cheng Microsoft Research Redmond, WA 98052 chinganc@microsoft.com Andrey Kolobov Microsoft Research Redmond, WA 98052 akolobov@microsoft.com Alekh Agarwal Microsoft Research Redmond, WA 98052 alekha@microsoft.com
Pseudocode Yes Algorithm 1 MAMBA for IL with multiple oracles
Open Source Code Yes The codes are provided at https://github.com/microsoft/MAMBA.
Open Datasets Yes Four continuous Gym [34] environments are used: Cart Pole and Double Inverted Pendulum (DIP) based on DART physics engine [35], and Halfcheetah and Ant based on Mujoco physics engine [36].
Dataset Splits No The paper uses continuous Gym environments where data is generated through interaction, and therefore does not specify explicit training/validation/test dataset splits in the conventional sense.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions software like Gym, DART physics engine, and Mujoco physics engine, and optimizers like ADAM and Natural Gradient Descent, but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes In each training iteration, an algorithm would perform H rollouts following the RIRO paradigm (see also Algorithm 1), where H = 8 for Cart Pole and DIP and H = 256 for Halfcheetah and Ant. To facilitate a meaningful comparison, we let these three algorithms use the same first-order optimizer4, train the same initial neural network policies, and share the same random seeds. 4ADAM [37] for Cart Pole, and Natural Gradient Descent [38] for DIP, Halfcheetah, and Ant