Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Near-Optimal Model-Free Reinforcement Learning in Non-Stationary Episodic MDPs
Authors: Weichao Mao, Kaiqing Zhang, Ruihao Zhu, David Simchi-Levi, Tamer Basar
ICML 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerical experiments validate the advantages of Restart Q-UCB in terms of both cumulative rewards and computational efficiency. We conduct simulations showing that Restart Q-UCB achieves highly competitive cumulative rewards against a state-of-the-art solution (Zhou et al., 2020), while only taking 0.18% of its computation time; In this section, we empirically evaluate Restart Q-UCB on reinforcement learning tasks with various types of non-stationarity. |
| Researcher Affiliation | Academia | 1Department of Electrical and Computer Engineering & Coordinated Science Laboratory, University of Illinois Urbana-Champaign, Urbana, IL, USA 2Institute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, MA, USA. |
| Pseudocode | Yes | Algorithm 1: Restart Q-UCB (Hoeffding) |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We evaluate the cumulative rewards of the four algorithms on a variant of a reinforcement learning task named Bidirectional Diabolical Combination Lock (Agarwal et al., 2020; Misra et al., 2020). |
| Dataset Splits | No | The paper mentions evaluating algorithms on a task and averaging results over runs, but it does not specify any training, validation, or test dataset splits. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper does not list any specific software components with version numbers (e.g., programming languages, libraries, frameworks) used in the experiments. |
| Experiment Setup | No | A detailed discussion on the task settings as well as the configuration of the hyper-parameters is deferred to Appendix I. |