Optimistic Exploration in Reinforcement Learning Using Symbolic Model Estimates

Authors: Sarath Sreedharan, Michael Katz

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform our evaluation in four different domains. [...] Table 1 presents the comparison of our method against Q learning for the planning benchmarks.
Researcher Affiliation Collaboration Sarath Sreedharan Department of Computer Science Colorado State University ssreedh3@colostate.edu Michael Katz IBM T.J. Watson Research Center michael.katz1@ibm.com
Pseudocode Yes Algorithm 1 Iteratively refine the model until a goal reaching trace is found
Open Source Code Yes he code can be found at https://github.com/sarathsreedharan/Model Learner.
Open Datasets Yes For the RL domain, we looked at two variants of minigrid problem. One was the version introduced by [26] (henceforth referred to as Minigrid-Parl) and the other being a simplified version of the original minigrid testbed [8]. (Citation [8]: Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. Minimalistic gridworld environment for openai gym. https://github.com/maximecb/gym-minigrid, 2018.)
Dataset Splits No The paper discusses evaluation using Q-learning episodes and sample counts but does not specify explicit training, validation, or test dataset splits or percentages.
Hardware Specification Yes All experiments were on a laptop running Mac OS v 11.06, with 2 GHz Quad-Core Intel Core i5 and 16 GB 3733 MHz LPDDR4X. We did not use CUDA in any of the experiments.
Software Dependencies No The paper mentions software like 'Simple RL framework' and 'FI-diverse-agl planner' but does not provide specific version numbers for these or other key software dependencies.
Experiment Setup Yes For all planning based instances we set the time limit to 10 minutes, while for the minigrid instances we extended the time limit to 30 minutes. [...] For all the RL baselines we used a discount factor of γ. For Q learning and R max, we used a maximum of 1000000 episodes with 200 steps per episode.