Nearly Optimal Policy Optimization with Stable at Any Time Guarantee
Authors: Tianhao Wu, Yunchang Yang, Han Zhong, Liwei Wang, Simon Du, Jiantao Jiao
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We prove that our algorithm achieves e O(AH4K) regret. When S > H, our algorithm is minimax optimal when ignoring logarithmic factors. To our best knowledge, RPO-SAT is the first computationally efficient, nearly minimax optimal policy-based algorithm for tabular RL. |
| Researcher Affiliation | Academia | 1University of California, Berkeley 2Center for Data Science, Peking University 3Peng Cheng Laboratory 4Key Laboratory of Machine Perception, MOE, School of Artificial Intelligence, Peking University 5University of Washington. |
| Pseudocode | Yes | Algorithm 1 Reference-based Policy Optimization with Stable at Any Time guarantee (RPO-SAT) |
| Open Source Code | No | The paper does not provide any information or links regarding open-source code for the described methodology. |
| Open Datasets | No | The paper is theoretical and does not mention specific datasets or their public availability for training. |
| Dataset Splits | No | The paper is theoretical and does not describe dataset splits for training, validation, or testing. |
| Hardware Specification | No | The paper is theoretical and does not describe any specific hardware used for experiments. |
| Software Dependencies | No | The paper is theoretical and does not mention any software dependencies with specific version numbers. |
| Experiment Setup | No | The paper is theoretical and does not describe an experimental setup with specific hyperparameters or training configurations. |