reproducibilityindex.ai

Policy Improvement via Imitation of Multiple Oracles

Authors: Ching-An Cheng, Andrey Kolobov, Alekh Agarwal

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In an evaluation against standard policy gradient with GAE and Aggre Va Te(D), we showcase MAMBA’s ability to leverage demonstrations both from a single and from multiple weak oracles, and signiﬁcantly speed up policy optimization. We corroborate our theoretical discoveries with simulations of IL from multiple oracles. We compare MAMBA with two representative algorithms: GAE Policy Gradient [17] (PG-GAE with λ = 0.9) for direct RL and Aggre Va Te D [14] for IL with a single oracle.
Researcher Affiliation	Industry	Ching-An Cheng Microsoft Research Redmond, WA 98052 chinganc@microsoft.com Andrey Kolobov Microsoft Research Redmond, WA 98052 akolobov@microsoft.com Alekh Agarwal Microsoft Research Redmond, WA 98052 alekha@microsoft.com
Pseudocode	Yes	Algorithm 1 MAMBA for IL with multiple oracles
Open Source Code	Yes	The codes are provided at https://github.com/microsoft/MAMBA.
Open Datasets	Yes	Four continuous Gym [34] environments are used: Cart Pole and Double Inverted Pendulum (DIP) based on DART physics engine [35], and Halfcheetah and Ant based on Mujoco physics engine [36].
Dataset Splits	No	The paper uses continuous Gym environments where data is generated through interaction, and therefore does not specify explicit training/validation/test dataset splits in the conventional sense.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions software like Gym, DART physics engine, and Mujoco physics engine, and optimizers like ADAM and Natural Gradient Descent, but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	In each training iteration, an algorithm would perform H rollouts following the RIRO paradigm (see also Algorithm 1), where H = 8 for Cart Pole and DIP and H = 256 for Halfcheetah and Ant. To facilitate a meaningful comparison, we let these three algorithms use the same ﬁrst-order optimizer4, train the same initial neural network policies, and share the same random seeds. 4ADAM [37] for Cart Pole, and Natural Gradient Descent [38] for DIP, Halfcheetah, and Ant