Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Sample Complexity of Distributionally Robust Average-Reward Reinforcement Learning
Authors: Zijun Chen, Shengbo Wang, Nian Si
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We further validate the convergence rates of our algorithms through numerical experiments. ... In this section, we present numerical experiments to validate our theoretical results. We employ the Hard MDP family introduced in Wang et al. [37], which confirms a minimax sample complexity lower bound of Ω(tminorizeε 2) for estimating the average reward to within an ε absolute error in the non-robust setting, matching the known upper bound. Our experiments show an empirical convergence rate of n 1/2 for both algorithms, validating them as the first algorithms that achieve this rate in the DR-AMDP setting. |
| Researcher Affiliation | Academia | Zijun Chen Department of Computer Science and Engineering Hong Kong University of Science and Technology EMAIL Shengbo Wang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California EMAIL Nian Si Department of Industrial Engineering and Decision Analytics Hong Kong University of Science and Technology EMAIL |
| Pseudocode | Yes | Algorithm 1 Distributional Robust DMDP: DR DMDP(γ, n, D) Algorithm 2 Reduction to DMDP Algorithm 3 Anchored DR-AMDP |
| Open Source Code | Yes | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We have included the complete experimental code in the supplemental materials. |
| Open Datasets | Yes | We employ the Hard MDP family introduced in Wang et al. [37], which confirms a minimax sample complexity lower bound of Ω(tminorizeε 2) for estimating the average reward to within an ε absolute error in the non-robust setting, matching the known upper bound. |
| Dataset Splits | No | The sub-figures in Figure 2 presents the error achieved by the algorithms using a total of n transition samples for every state-action pair. Each data point in the plots corresponds to a single estimate generated by one independent run of the corresponding algorithm. Explanation: The paper describes using a generative model to produce 'n transition samples' and specific MDP instances, but it does not specify traditional train/test/validation splits for a pre-existing dataset. The context implies data is generated as needed rather than split from a fixed pool. |
| Hardware Specification | No | Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [NA] Justification: Our sample complexity analysis provides statistical guarantees that are independent of computational power considerations, which represent a distinct aspect from our theoretical focus. |
| Software Dependencies | No | The paper does not explicitly mention any specific software dependencies (e.g., programming languages, libraries, frameworks, or solvers) along with their version numbers. |
| Experiment Setup | Yes | In this section, we present numerical experiments to validate our theoretical results. We employ the Hard MDP family introduced in Wang et al. [37]... Next, we evaluate the performance of Algorithm 2 and 3 by analyzing their value approximation errors under both KL and χ2 uncertainty sets. ... The sub-figures in Figure 2 presents the error achieved by the algorithms using a total of n transition samples for every state-action pair. ... The nominal transition distribution is specified by ps,a N(1, σs,a), where σs,a Uniform[0, 100], followed by normalization. We then choose the uncertainty size δ = 0.4 to introduce stronger perturbations, following the setting in Wang et al. [46] under the KL-divergence. |