Nearly Optimal Policy Optimization with Stable at Any Time Guarantee

Authors: Tianhao Wu, Yunchang Yang, Han Zhong, Liwei Wang, Simon Du, Jiantao Jiao

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We prove that our algorithm achieves e O(AH4K) regret. When S > H, our algorithm is minimax optimal when ignoring logarithmic factors. To our best knowledge, RPO-SAT is the first computationally efficient, nearly minimax optimal policy-based algorithm for tabular RL.
Researcher Affiliation Academia 1University of California, Berkeley 2Center for Data Science, Peking University 3Peng Cheng Laboratory 4Key Laboratory of Machine Perception, MOE, School of Artificial Intelligence, Peking University 5University of Washington.
Pseudocode Yes Algorithm 1 Reference-based Policy Optimization with Stable at Any Time guarantee (RPO-SAT)
Open Source Code No The paper does not provide any information or links regarding open-source code for the described methodology.
Open Datasets No The paper is theoretical and does not mention specific datasets or their public availability for training.
Dataset Splits No The paper is theoretical and does not describe dataset splits for training, validation, or testing.
Hardware Specification No The paper is theoretical and does not describe any specific hardware used for experiments.
Software Dependencies No The paper is theoretical and does not mention any software dependencies with specific version numbers.
Experiment Setup No The paper is theoretical and does not describe an experimental setup with specific hyperparameters or training configurations.