Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Model-Free Active Exploration in Reinforcement Learning
Authors: Alessio Russo, Alexandre Proutiere
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerical results demonstrate that our strategy is able to identify efficient policies faster than state-of-the-art exploration approaches. ... Numerical results on hard-exploration problems highlighted the effectiveness of our approach for learning efficient policies over state-of-the-art methods. |
| Researcher Affiliation | Academia | Alessio Russo Division of Decision and Control Systems KTH Royal Institute of Technology Stockholm, SE Alexandre Proutiere Division of Decision and Control Systems KTH Royal Institute of Technology Stockholm, SE |
| Pseudocode | Yes | Algorithm 1 Boostrapped MF-BPI (Boostrapped Model Free Best Policy Identification) ... Algorithm 2 DBMF-BPI (Deep Bootstrapped Model Free BPI) |
| Open Source Code | Yes | Code repository: https://github.com/rssalessio/Model Free Active Exploration RL |
| Open Datasets | Yes | For the tabular MDPs, we test the performance of MF-BPI on the Riverswim and the Forked Riverswim environments... For continuous state-spaces, we compare our algorithm to IDS[33] and BSP [39]... and assess their performance on hard-exploration problems from the Deep Mind BSuite [41] (the Deep Sea and the Cartpole swingup problems). |
| Dataset Splits | No | The paper does not explicitly provide training, validation, and test dataset splits with specific percentages or sample counts. It refers to standard environments and evaluation after T steps or episodes, which are typical for RL, but not in the conventional supervised learning dataset split sense. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for the experiments (e.g., CPU, GPU models, or memory specifications). |
| Software Dependencies | No | The paper does not provide specific version numbers for ancillary software or libraries used in the experiments. |
| Experiment Setup | Yes | Require: Parameters (λ, k, p); ensemble size B; learning rates {(αt, βt)}t. ... Details of the experiments, including the initialization of the algorithms, are provided in Appendix A. |