Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Provable Zero-Shot Generalization in Offline Reinforcement Learning

Authors: Zhiyong Wang, Chen Yang, John C.S. Lui, Dongruo Zhou

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We propose two meta-algorithms called pessimistic empirical risk minimization (PERM) and pessimistic proximal policy optimization (PPPO) that enable ZSG for offline RL (Jin et al., 2021). ... Our result shows that the sub-optimalities of the output policies are bounded by both the supervised learning error, which is controlled by the number of different environments, and the reinforcement learning error, which is controlled by the coverage of the offline dataset to the optimal policy. Please refer to Table 1 for a summary of our results. To the best of our knowledge, our proposed algorithms are the first offline RL methods that provably enjoy the ZSG property. ... Next we propose a theoretical analysis of PERM. ... Theorem 9 Set the Evaluation subroutine in Algorithm 2 as PPE (Algo.1). ... the output πPERM of Algorithm 2 satisfies Sub Opt(πPERM) ...
Researcher Affiliation Academia 1Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China 2Department of Computer Science, Indiana University Bloomington, Bloomington, USA.
Pseudocode Yes Algorithm 1 Pessimistic Policy Evaluation (PPE) ... Algorithm 2 Pessimistic Empirical Risk Minimization (PERM) ... Algorithm 3 Pessimistic Proximal Policy Optimzation (PPPO) ... Algorithm 4 Pessimistic Policy Evaluation (PPE): Linear MDP
Open Source Code No The paper does not contain any statements or links regarding the release of source code for the methodology described.
Open Datasets No The paper discusses theoretical properties of offline datasets and a data collection process, but it does not specify or provide access information for any concrete publicly available dataset used for experimental validation.
Dataset Splits No The paper is theoretical and does not describe any experimental setup involving dataset splits.
Hardware Specification No The paper is theoretical and does not describe any specific hardware used for running experiments.
Software Dependencies No The paper is theoretical and does not specify any software dependencies with version numbers required for replication.
Experiment Setup No The paper is theoretical and does not describe any specific experimental setup details or hyperparameters.