A Meta-MDP Approach to Exploration for Lifelong Reinforcement Learning
Authors: Francisco Garcia, Philip S. Thomas
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conclude with experiments that show the benefits of optimizing an exploration strategy using our proposed framework. 6 Empirical Results In this section we present experiments for discrete and continuous control tasks. |
| Researcher Affiliation | Academia | Francisco M. Garcia and Philip S. Thomas College of Information and Computer Sciences University of Massachusetts Amherst Amherst, MA, USA {fmgarcia,pthomas}@cs.umass.edu |
| Pseudocode | Yes | Pseudocode for the implementations used in our framework using REINFORCE and PPO are shown in Appendix C. |
| Open Source Code | Yes | Code used for this paper can be found at https://github.com/fmaxgarcia/Meta-MDP |
| Open Datasets | Yes | Implementations used for the discrete case pole-balancing and all continuous control problems, where taken from Open AI Gym, Roboschool benchmarks [2]. For the driving task experiments we used a simulator implemented in Unity by Tawn Kramer from the Donkey Car community 1. 1The Unity simulator for the self-driving task can be found at https://github.com/tawnkramer/ sdsandbox |
| Dataset Splits | No | The paper refers to 'training tasks' and 'testing tasks' but does not specify explicit training, validation, and test dataset splits with percentages or counts for any single dataset. |
| Hardware Specification | No | The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions 'Open AI Gym', 'Roboschool', and 'Unity' as software used but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | In our experiments we set the initial value of to 0.8, and decreased by a factor of 0.995 every episode. Both policies, and µ, were trained using REINFORCE: for I = 1,000 episodes and µ for 500 iterations. |