Near-Optimal Model-Free Reinforcement Learning in Non-Stationary Episodic MDPs

Authors: Weichao Mao, Kaiqing Zhang, Ruihao Zhu, David Simchi-Levi, Tamer Basar

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerical experiments validate the advantages of Restart Q-UCB in terms of both cumulative rewards and computational efficiency. We conduct simulations showing that Restart Q-UCB achieves highly competitive cumulative rewards against a state-of-the-art solution (Zhou et al., 2020), while only taking 0.18% of its computation time; In this section, we empirically evaluate Restart Q-UCB on reinforcement learning tasks with various types of non-stationarity.
Researcher Affiliation Academia 1Department of Electrical and Computer Engineering & Coordinated Science Laboratory, University of Illinois Urbana-Champaign, Urbana, IL, USA 2Institute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, MA, USA.
Pseudocode Yes Algorithm 1: Restart Q-UCB (Hoeffding)
Open Source Code No The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available.
Open Datasets Yes We evaluate the cumulative rewards of the four algorithms on a variant of a reinforcement learning task named Bidirectional Diabolical Combination Lock (Agarwal et al., 2020; Misra et al., 2020).
Dataset Splits No The paper mentions evaluating algorithms on a task and averaging results over runs, but it does not specify any training, validation, or test dataset splits.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper does not list any specific software components with version numbers (e.g., programming languages, libraries, frameworks) used in the experiments.
Experiment Setup No A detailed discussion on the task settings as well as the configuration of the hyper-parameters is deferred to Appendix I.