Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Meta-MDP Approach to Exploration for Lifelong Reinforcement Learning

Authors: Francisco Garcia, Philip S. Thomas

NeurIPS 2019 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conclude with experiments that show the benefits of optimizing an exploration strategy using our proposed framework. 6 Empirical Results In this section we present experiments for discrete and continuous control tasks.
Researcher Affiliation Academia Francisco M. Garcia and Philip S. Thomas College of Information and Computer Sciences University of Massachusetts Amherst Amherst, MA, USA EMAIL
Pseudocode Yes Pseudocode for the implementations used in our framework using REINFORCE and PPO are shown in Appendix C.
Open Source Code Yes Code used for this paper can be found at https://github.com/fmaxgarcia/Meta-MDP
Open Datasets Yes Implementations used for the discrete case pole-balancing and all continuous control problems, where taken from Open AI Gym, Roboschool benchmarks [2]. For the driving task experiments we used a simulator implemented in Unity by Tawn Kramer from the Donkey Car community 1. 1The Unity simulator for the self-driving task can be found at https://github.com/tawnkramer/ sdsandbox
Dataset Splits No The paper refers to 'training tasks' and 'testing tasks' but does not specify explicit training, validation, and test dataset splits with percentages or counts for any single dataset.
Hardware Specification No The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions 'Open AI Gym', 'Roboschool', and 'Unity' as software used but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes In our experiments we set the initial value of to 0.8, and decreased by a factor of 0.995 every episode. Both policies, and µ, were trained using REINFORCE: for I = 1,000 episodes and µ for 500 iterations.