Near-Optimal Model-Free Reinforcement Learning in Non-Stationary Episodic MDPs
Authors: Weichao Mao, Kaiqing Zhang, Ruihao Zhu, David Simchi-Levi, Tamer Basar
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerical experiments validate the advantages of Restart Q-UCB in terms of both cumulative rewards and computational efficiency. We conduct simulations showing that Restart Q-UCB achieves highly competitive cumulative rewards against a state-of-the-art solution (Zhou et al., 2020), while only taking 0.18% of its computation time; In this section, we empirically evaluate Restart Q-UCB on reinforcement learning tasks with various types of non-stationarity. |
| Researcher Affiliation | Academia | 1Department of Electrical and Computer Engineering & Coordinated Science Laboratory, University of Illinois Urbana-Champaign, Urbana, IL, USA 2Institute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, MA, USA. |
| Pseudocode | Yes | Algorithm 1: Restart Q-UCB (Hoeffding) |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We evaluate the cumulative rewards of the four algorithms on a variant of a reinforcement learning task named Bidirectional Diabolical Combination Lock (Agarwal et al., 2020; Misra et al., 2020). |
| Dataset Splits | No | The paper mentions evaluating algorithms on a task and averaging results over runs, but it does not specify any training, validation, or test dataset splits. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper does not list any specific software components with version numbers (e.g., programming languages, libraries, frameworks) used in the experiments. |
| Experiment Setup | No | A detailed discussion on the task settings as well as the configuration of the hyper-parameters is deferred to Appendix I. |