Provably Efficient Exploration in Policy Optimization
Authors: Qi Cai, Zhuoran Yang, Chi Jin, Zhaoran Wang
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves e O(d2H3T) regret. Here d is the feature dimension, H is the episode horizon, and T is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores. |
| Researcher Affiliation | Academia | 1Department of Industrial Engineering and Management Sciences, Northwestern University 2Department of Operations Research and Financial Engineering, Princeton University 3Department of Electrical Engineering, Princeton University. |
| Pseudocode | Yes | Algorithm 1 Optimistic PPO (OPPO) |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. Footnote 1 refers to the arXiv version of the paper itself, not an external code repository. |
| Open Datasets | No | The paper defines a theoretical setting (episodic Markov decision process with linear function approximation) but does not describe the use of any specific public or open dataset for training, as it is a theoretical work. |
| Dataset Splits | No | As a theoretical paper without empirical studies, it does not specify dataset splits for training, validation, or testing. |
| Hardware Specification | No | As a theoretical paper, it does not provide details about hardware specifications used for running experiments. |
| Software Dependencies | No | As a theoretical paper, it does not provide details about specific ancillary software or library versions needed to replicate experiments. |
| Experiment Setup | No | As a theoretical paper, it does not provide specific experimental setup details such as hyperparameter values or training configurations. |