A Meta-MDP Approach to Exploration for Lifelong Reinforcement Learning

Authors: Francisco Garcia, Philip S. Thomas

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conclude with experiments that show the benefits of optimizing an exploration strategy using our proposed framework. 6 Empirical Results In this section we present experiments for discrete and continuous control tasks.
Researcher Affiliation Academia Francisco M. Garcia and Philip S. Thomas College of Information and Computer Sciences University of Massachusetts Amherst Amherst, MA, USA {fmgarcia,pthomas}@cs.umass.edu
Pseudocode Yes Pseudocode for the implementations used in our framework using REINFORCE and PPO are shown in Appendix C.
Open Source Code Yes Code used for this paper can be found at https://github.com/fmaxgarcia/Meta-MDP
Open Datasets Yes Implementations used for the discrete case pole-balancing and all continuous control problems, where taken from Open AI Gym, Roboschool benchmarks [2]. For the driving task experiments we used a simulator implemented in Unity by Tawn Kramer from the Donkey Car community 1. 1The Unity simulator for the self-driving task can be found at https://github.com/tawnkramer/ sdsandbox
Dataset Splits No The paper refers to 'training tasks' and 'testing tasks' but does not specify explicit training, validation, and test dataset splits with percentages or counts for any single dataset.
Hardware Specification No The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions 'Open AI Gym', 'Roboschool', and 'Unity' as software used but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes In our experiments we set the initial value of to 0.8, and decreased by a factor of 0.995 every episode. Both policies, and µ, were trained using REINFORCE: for I = 1,000 episodes and µ for 500 iterations.