OptionGAN: Learning Joint Reward-Policy Options Using Generative Adversarial Inverse Reinforcement Learning

Authors: Peter Henderson, Wei-Di Chang, Pierre-Luc Bacon, David Meger, Joelle Pineau, Doina Precup

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Option GAN in the context of continuous control locomotion tasks, considering both simulated Mu Jo Co locomotion Open AI Gym environments (Brockman et al. 2016), modifications of these environments for task transfer (Henderson et al. 2017), and a more complex Roboschool task (Schulman et al. 2017). We show that the final policies learned using joint reward-policy options outperform a single reward approximator and policy network in most cases, and particularly excel at one-shot transfer learning. (...) Table 1 shows the overall results of our evaluations and we highlight a subset of learning curves in Figure 3. (...) Ablation Investigations
Researcher Affiliation Academia Peter Henderson,1 Wei-Di Chang,2 Pierre-Luc Bacon,1 David Meger, Joelle Pineau, Doina Precup 1 1 1 1 School of Computer Science, Mc Gill University, Montreal, Canada 2 Department of Electrical, Computer, and Software Engineering, Mc Gill University, Montreal, Canada
Pseudocode Yes Algorithm 1: IRLGAN (...) Algorithm 2: Option GAN
Open Source Code Yes Code is located at: https://github.com/Breakend/OptionGAN.
Open Datasets Yes We use the Hopper-v1, Half Cheetah-v1, and Walker2d-v1 locomotion environments (...) Open AI Gym environments (Brockman et al. 2016) (...) Mu Jo Co simulator (Todorov, Erez, and Tassa 2012) (...) Hopper Simple Wall-v0 environment provided by the gym-extensions framework (Henderson et al. 2017) and the Roboschool Humanoid Flagrun-v1 environment used in (Schulman et al. 2017).
Dataset Splits No The paper mentions collecting expert rollouts and sampling trajectories for training and evaluation but does not specify explicit training, validation, and test dataset splits with percentages or sample counts for the models themselves.
Hardware Specification No The paper does not specify any particular hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using "Multilayer perceptrons", "TRPO", "PPO", "Mu Jo Co simulator", and "Open AI Gym environments" but does not provide specific version numbers for any of these software components or libraries.
Experiment Setup Yes All shared hyperparameters are held constant between IRLGAN and Option GAN evaluation runs. All evaluations are averaged across 10 trials, each using a different random seed. (...) For simple settings all hidden layers are of size (64, 64) and for complex experiments are (128, 128). For the 2-options case we set λe = 10.0, λb = 10.0, λv = 1.0 based on a simple hyperparameter search and reported results from (Bengio et al. 2015). For the 4-options case we relax the regularizer that encourages a uniform distribution of options (Lb), setting λb = .01.