Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Optimistic Exploration in Reinforcement Learning Using Symbolic Model Estimates
Authors: Sarath Sreedharan, Michael Katz
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform our evaluation in four different domains. [...] Table 1 presents the comparison of our method against Q learning for the planning benchmarks. |
| Researcher Affiliation | Collaboration | Sarath Sreedharan Department of Computer Science Colorado State University EMAIL Michael Katz IBM T.J. Watson Research Center EMAIL |
| Pseudocode | Yes | Algorithm 1 Iteratively refine the model until a goal reaching trace is found |
| Open Source Code | Yes | he code can be found at https://github.com/sarathsreedharan/Model Learner. |
| Open Datasets | Yes | For the RL domain, we looked at two variants of minigrid problem. One was the version introduced by [26] (henceforth referred to as Minigrid-Parl) and the other being a simplified version of the original minigrid testbed [8]. (Citation [8]: Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. Minimalistic gridworld environment for openai gym. https://github.com/maximecb/gym-minigrid, 2018.) |
| Dataset Splits | No | The paper discusses evaluation using Q-learning episodes and sample counts but does not specify explicit training, validation, or test dataset splits or percentages. |
| Hardware Specification | Yes | All experiments were on a laptop running Mac OS v 11.06, with 2 GHz Quad-Core Intel Core i5 and 16 GB 3733 MHz LPDDR4X. We did not use CUDA in any of the experiments. |
| Software Dependencies | No | The paper mentions software like 'Simple RL framework' and 'FI-diverse-agl planner' but does not provide specific version numbers for these or other key software dependencies. |
| Experiment Setup | Yes | For all planning based instances we set the time limit to 10 minutes, while for the minigrid instances we extended the time limit to 30 minutes. [...] For all the RL baselines we used a discount factor of γ. For Q learning and R max, we used a maximum of 1000000 episodes with 200 steps per episode. |