Optimistic Exploration in Reinforcement Learning Using Symbolic Model Estimates
Authors: Sarath Sreedharan, Michael Katz
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform our evaluation in four different domains. [...] Table 1 presents the comparison of our method against Q learning for the planning benchmarks. |
| Researcher Affiliation | Collaboration | Sarath Sreedharan Department of Computer Science Colorado State University ssreedh3@colostate.edu Michael Katz IBM T.J. Watson Research Center michael.katz1@ibm.com |
| Pseudocode | Yes | Algorithm 1 Iteratively refine the model until a goal reaching trace is found |
| Open Source Code | Yes | he code can be found at https://github.com/sarathsreedharan/Model Learner. |
| Open Datasets | Yes | For the RL domain, we looked at two variants of minigrid problem. One was the version introduced by [26] (henceforth referred to as Minigrid-Parl) and the other being a simplified version of the original minigrid testbed [8]. (Citation [8]: Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. Minimalistic gridworld environment for openai gym. https://github.com/maximecb/gym-minigrid, 2018.) |
| Dataset Splits | No | The paper discusses evaluation using Q-learning episodes and sample counts but does not specify explicit training, validation, or test dataset splits or percentages. |
| Hardware Specification | Yes | All experiments were on a laptop running Mac OS v 11.06, with 2 GHz Quad-Core Intel Core i5 and 16 GB 3733 MHz LPDDR4X. We did not use CUDA in any of the experiments. |
| Software Dependencies | No | The paper mentions software like 'Simple RL framework' and 'FI-diverse-agl planner' but does not provide specific version numbers for these or other key software dependencies. |
| Experiment Setup | Yes | For all planning based instances we set the time limit to 10 minutes, while for the minigrid instances we extended the time limit to 30 minutes. [...] For all the RL baselines we used a discount factor of γ. For Q learning and R max, we used a maximum of 1000000 episodes with 200 steps per episode. |