Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Optimal Single-Policy Sample Complexity and Transient Coverage for Average-Reward Offline RL
Authors: Matthew Zurek, Guy Zamir, Yudong Chen
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We study offline reinforcement learning in average-reward MDPs, which presents increased challenges from the perspectives of distribution shift and non-uniform coverage, and has been relatively underexamined from a theoretical perspective. We develop sharp guarantees depending only on the target policy, specifically the bias span and a novel policy hitting radius, yielding the first fully single-policy sample complexity bound for average-reward offline RL. We are also the first to handle general weakly communicating MDPs, contrasting restrictive structural assumptions made in prior work. To achieve this, we introduce an algorithm based on pessimistic discounted value iteration enhanced by a novel quantile clipping technique, which enables the use of a sharper empirical-span-based penalty function. Our algorithm also does not require any prior parameter knowledge for its implementation. Remarkably, we show via hard examples that learning under our conditions requires coverage assumptions beyond the stationary distribution of the target policy, distinguishing single-policy complexity measures from previously examined cases. We also develop lower bounds nearly matching our main result. |
| Researcher Affiliation | Academia | Matthew Zurek Department of Computer Sciences University of Wisconsin Madison EMAIL Guy Zamir Department of Computer Sciences University of Wisconsin Madison EMAIL Yudong Chen Department of Computer Sciences University of Wisconsin Madison EMAIL |
| Pseudocode | Yes | Algorithm 1 Pessimistic Value Iteration With Quantile Clipping input: Dataset D, reward function r, discount factor γ (0, 1), failure probability δ (0, 1) 1: Form empirical transition matrix b P used in b Tpe from D 2: Let b Q0 = 0 and K = l log( 2ntot 1 γ ) 1 γ m initialization and number of iterations 3: for t = 1, . . . , K do 4: Let b Qt = b Tpe( b Qt 1) 5: end for 6: Let b Q = b QK and for each s S, let bπ(s) argmaxa A b Q(s, a) 7: return bπ, b Q |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | No | The paper discusses learning from "historical data" and refers to "dataset D = (s, a, Si s,a)", but it does not specify any publicly available datasets, provide links, or formal citations to external datasets used in experiments. |
| Dataset Splits | No | The paper does not contain any experimental evaluation on specific datasets, therefore, no dataset split information is provided. |
| Hardware Specification | No | The paper does not conduct experiments, and therefore, no hardware specifications are provided. |
| Software Dependencies | No | The paper does not conduct experiments, and therefore, no specific software dependencies with version numbers are listed. |
| Experiment Setup | No | The paper focuses on theoretical contributions, including algorithm design, but does not present an experimental setup with specific hyperparameter values or training configurations. |