Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs
Authors: Jiafan He, Dongruo Zhou, Quanquan Gu
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We propose a model-based algorithm named UCBVI-γ, which is based on the optimism in the face of uncertainty principle and the Bernstein-type bonus. We show that UCBVI-γ achieves an e O SAT/(1 γ)1.5 regret... In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least eΩ SAT/(1 γ)1.5 . Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI-γ is nearly minimax optimal for discounted MDPs. |
| Researcher Affiliation | Academia | Jiafan He Department of Computer Science University of California, Los Angeles CA 90095, USA jiafanhe19@ucla.edu Dongruo Zhou Department of Computer Science University of California, Los Angeles CA 90095, USA drzhou@cs.ucla.edu Quanquan Gu Department of Computer Science University of California, Los Angeles CA 90095, USA qgu@cs.ucla.edu |
| Pseudocode | Yes | Algorithm 1 Upper Confidence Value-iteration UCBVI-γ |
| Open Source Code | No | The paper states "[N/A]" for including code/data/instructions needed to reproduce main experimental results and does not provide any other information about source code availability. |
| Open Datasets | No | The paper states "[N/A]" for questions related to data and experiments, indicating that no datasets were used in an experimental setting. |
| Dataset Splits | No | The paper states "[N/A]" for questions related to data and experiments, indicating that no dataset splits for training/validation were used. |
| Hardware Specification | No | The paper states "[N/A]" for questions related to compute resources, and no hardware specifications are mentioned as it is a theoretical work. |
| Software Dependencies | No | The paper states "[N/A]" for questions related to training details and compute resources, and does not mention any specific software dependencies or versions. |
| Experiment Setup | No | The paper focuses on theoretical analysis and proofs, and explicitly states "[N/A]" for questions related to training details and experimental setup. |