reproducibilityindex.ai

Near Optimal Reward-Free Reinforcement Learning

Authors: Zihan Zhang, Simon Du, Xiangyang Ji

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	We give a new efﬁcient algorithm, Staged Sampling + Truncated Planning (SSTP), which interacts with the environment at most O episodes in the exploration phase, and guarantees to output a nearoptimal policy for arbitrary reward functions in the planning phase, where S is the size of state space, A is the size of action space, H is the planning horizon, and is the target accuracy relative to the total reward. Notably, our sample complexity scales only logarithmically with H, in contrast to all existing results which scale polynomially with H. Furthermore, this bound matches the minimax lower bound up to logarithmic factors. Our results rely on three new techniques : 1) A new sufﬁcient condition for the dataset to plan for an -suboptimal policy ; 2) A new way to plan efﬁciently under the proposed condition using soft-truncated planning; 3) Constructing extended MDP to maximize the truncated accumulative rewards efﬁciently.
Researcher Affiliation	Academia	1Tsinghua University 2University of Washington. Correspondence to: Zihan Zhang <zihan-zh17@mails.tsinghua.edu.cn>, Simon S. Du <ssdu@cs.washington.edu>, Xiangyang Ji <xyji@tsinghua.edu.cn>.
Pseudocode	Yes	Algorithm 1 MAIN ALGORITHM: STAGED SAMPLING + TRUNCATED PLANNING; Algorithm 2 STAGED SAMPLING; Algorithm 3 Truncated Planning; Algorithm 4 and the proof of Lemma 3 is postponed to Appendix.B due to limitation of space.
Open Source Code	No	The paper does not provide any statement or link indicating the availability of open-source code for the described methodology.
Open Datasets	No	The paper is theoretical and analyzes algorithms. It does not mention using any specific publicly available datasets for experimental training.
Dataset Splits	No	The paper is theoretical and does not perform empirical validation on datasets, thus no dataset splits for training/validation/testing are provided.
Hardware Specification	No	The paper is theoretical and does not mention any specific hardware used for running experiments.
Software Dependencies	No	The paper is theoretical and does not specify any software dependencies with version numbers for running experiments.
Experiment Setup	No	The paper is theoretical and focuses on algorithm design and proofs; it does not include details on experimental setup, such as hyperparameters or training configurations.