Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SPOT: Scalable Policy Optimization with Trees for Markov Decision Processes

Authors: Xuyuan Xiong, Pedro Chumpitaz-Flores, Kaixun Hua, Cheng Hua

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on standard benchmarks demonstrate that SPOT achieves substantial speedup and scales to larger MDPs with a significantly higher number of states. The resulting decision tree policies are interpretable and compact, maintaining transparency without compromising performance. These results demonstrate that our approach simultaneously achieves interpretability and scalability, delivering high-quality policies an order of magnitude faster than existing approaches.
Researcher Affiliation Academia 1Shanghai Jiao Tong University 2University of South Florida
Pseudocode Yes The complete Algorithm is presented in Algorithm 1. The overall structure and procedure of the RSBB algorithm are summarized in Algorithm 2 (see Supplementary Material).
Open Source Code Yes Answer: [Yes]. Justification: We provide the whole code package including the testing data file in the supplementary material for reproduction.
Open Datasets Yes The first four MDPs are available at https://github.com/tudelft-cda-lab/OMDT, while the remaining five can be found at https://github.com/prismmodelchecker/prism-benchmarks/.
Dataset Splits No MDP Preparation We prepare a set of 9 MDP datasets to evaluate our methods: sys_ad_1, sys_ad_2, tic_vs_ran, tiger_vs_ant, csma_2_2, csma_2_4, firewire, wlan0, and wlan1. The goal is to find a policy π that maximizes the expected cumulative discounted reward Eπ P t=0 γt Ritit+1kt .
Hardware Specification No We then perform a single iteration of optimization to find a tree with depth D = 3 under three different approaches: (i) solving directly with Gurobi, (ii) using RSBB (reduced space branch-and-bound) in serial, and (iii) using RSBB in parallel across 10 CPU cores.
Software Dependencies No We then perform a single iteration of optimization to find a tree with depth D = 3 under three different approaches: (i) solving directly with Gurobi, (ii) using RSBB (reduced space branch-and-bound) in serial, and (iii) using RSBB in parallel across 10 CPU cores.
Experiment Setup Yes During SPOT training, we adopt a time-constrained setup with a total runtime budget of one hour (60 minutes) and a maximum of five minutes allocated per iteration. At each step, we retain the best solution found within the allotted time window. We fix the threshold ϕl at 0.5 and, for simplicity, set all coefficients Phiπold i to 1. To enhance exploration, we employ an epsilon-decay strategy: when computing V old, the evaluation is based on a greedy policy with an exploration probability ϵ = 1/l, where l denotes the current iteration index. With probability ϵ, the policy randomly selects alternative actions rather than the greedy one. This mechanism encourages the policy to explore beyond the most immediate choices. In each iteration, we use Algorithm 2 to compute the solution. The solver is configured with a gap tolerance of 0.01 and a maximum runtime of 14,800 seconds.