reproducibilityindex.ai

Nearly Optimal Policy Optimization with Stable at Any Time Guarantee

Authors: Tianhao Wu, Yunchang Yang, Han Zhong, Liwei Wang, Simon Du, Jiantao Jiao

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	We prove that our algorithm achieves e O(AH4K) regret. When S > H, our algorithm is minimax optimal when ignoring logarithmic factors. To our best knowledge, RPO-SAT is the ﬁrst computationally efﬁcient, nearly minimax optimal policy-based algorithm for tabular RL.
Researcher Affiliation	Academia	1University of California, Berkeley 2Center for Data Science, Peking University 3Peng Cheng Laboratory 4Key Laboratory of Machine Perception, MOE, School of Artiﬁcial Intelligence, Peking University 5University of Washington.
Pseudocode	Yes	Algorithm 1 Reference-based Policy Optimization with Stable at Any Time guarantee (RPO-SAT)
Open Source Code	No	The paper does not provide any information or links regarding open-source code for the described methodology.
Open Datasets	No	The paper is theoretical and does not mention specific datasets or their public availability for training.
Dataset Splits	No	The paper is theoretical and does not describe dataset splits for training, validation, or testing.
Hardware Specification	No	The paper is theoretical and does not describe any specific hardware used for experiments.
Software Dependencies	No	The paper is theoretical and does not mention any software dependencies with specific version numbers.
Experiment Setup	No	The paper is theoretical and does not describe an experimental setup with specific hyperparameters or training configurations.