Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Sample-Efficient Tabular Self-Play for Offline Robust Reinforcement Learning
Authors: Na Li, Zewu Zheng, Wei Ni, Hangguan Shan, Wenjie Zhang, Xinyu Li
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To effectively evaluate our algorithm, we have conducted numerical experiments on randomly generated transition kernels, following the code proposed by [39]. In particular, we adopt the parameter setting as S = 50, A = B = 2, and H = 100, averaged over 100 seeds. Experiments are conducted on Py Torch 2.0.0 with a single NVIDIA RTX 4090 24GB GPU. In our experiments, the robust NE at each state and timestep is computed using standard NE solvers, i.e., the Python package nashpy. Our algorithm is compatible with any exact or approximate NE solver, including computational relaxations or sampling-based methods. As shown in Figure 1(a), the case of K = 148 e5 demonstrates that our proposed algorithm consistently outperforms the baseline value iteration for robust TZMGs (RTZ-VI) across all states and all sample sizes. This trend remains consistent across other values of K as well. Moreover, we have plotted the sub-optimality performance gap of RTZ-VI-LCB w.r.t. the sample size on a log-log scale to corroborate the scaling of the sample size on the performance gap. Fitting using linear regression leads to a slope estimate of 0.4877. This nicely matches the finding of our theoretical guarantee. |
| Researcher Affiliation | Academia | Na Li Zhejiang University EMAIL Zewu Zheng The Chinese University of Hong Kong EMAIL Wei Ni Edith Cowan University and University of New South Wales EMAIL Hangguan Shan Zhejiang University EMAIL Wenjie Zhang University of New South Wales EMAIL Xinyu Li Huazhong University of Science and Technology EMAIL |
| Pseudocode | Yes | Algorithm 1 Two-stage subsampling for RTZ-VI-LCB. Algorithm 2 Value iteration with lower confidence bounds for RTZMGs (RTZ-VI-LCB). Algorithm 3 Two-stage subsampling for Multi-RTZ-VI-LCB. Algorithm 4 Multi-RTZ-VI-LCB. |
| Open Source Code | Yes | Code is available at https://github.com/NLee10/RTZ-VI-LCB. |
| Open Datasets | No | The paper conducts numerical experiments on "randomly generated transition kernels" but does not provide specific access information (link, citation, repository) for these generated datasets or any other public dataset. |
| Dataset Splits | No | The paper describes experiments on 'randomly generated transition kernels' and mentions averaging over '100 seeds', which are experimental runs for generating environment data. It does not discuss typical training/test/validation splits of a static dataset, as the research is in reinforcement learning with environment interactions. |
| Hardware Specification | Yes | Experiments are conducted on Py Torch 2.0.0 with a single NVIDIA RTX 4090 24GB GPU. |
| Software Dependencies | Yes | Experiments are conducted on Py Torch 2.0.0 with a single NVIDIA RTX 4090 24GB GPU. In our experiments, the robust NE at each state and timestep is computed using standard NE solvers, i.e., the Python package nashpy. |
| Experiment Setup | Yes | In particular, we adopt the parameter setting as S = 50, A = B = 2, and H = 100, averaged over 100 seeds. |