Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

On Evaluating Policies for Robust POMDPs

Authors: Merlijn Krale, Eline M. Bovy, Maris F. L. Galesloot, Thiago Simão, Nils Jansen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental evaluation shows that (1) our proposed benchmarks cannot be solved by assuming naive nature policies, (2) our method of evaluating policies is accurate, and (3) the upper bounds provide solid baselines for evaluation.
Researcher Affiliation	Academia	Merlijn Krale Radboud University Nijmegen, The Netherlands EMAIL M. Bovy Radboud University Nijmegen, The Netherlands EMAIL F. L. Galesloot Radboud University Nijmegen, The Netherlands EMAIL D. Simão Eindhoven University of Technology Eindhoven, The Netherlands EMAIL Jansen Ruhr-University Bochum & Radboud University Bochum, Germany & Nijmegen, The Netherlands EMAIL
Pseudocode	No	The paper describes algorithms and modifications to existing methods (e.g., RHSVI) but does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	We implement all methods in a Julia framework (based on POMDPs.jl [12]) to facilitate future research; available on Zenodo [29].
Open Datasets	Yes	Secondly, we lift several POMDPs from the literature into RPOMDPs: TIGER [6], MINIHALLWAY [33], and ALOHA [24], as well as an expanded variant of HEAVENORHELL [4] (also used in [45]).
Dataset Splits	No	The paper uses simulated environments and benchmarks rather than static datasets with explicit training, validation, or test splits. The concept of dataset splits is not applicable in this context.
Hardware Specification	Yes	All experiments were conducted in Julia (version 1.11.5) on the same Ubuntu machine (version 22.04.5 LTS), which has an Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz and 256GB RAM (8 x 32GB DDR4-3200).
Software Dependencies	Yes	We implement our evaluation method in the Julia programming language, using a variant of the POMDPs.jl framework [12] for RPOMDPs with interval uncertainty sets... All experiments were conducted in Julia (version 1.11.5)...
Experiment Setup	Yes	We use discount factor γ = 1 for TOY , of γ = 0.99 for ECHO en HEAVENORHELL, and of γ = 0.95 for all other environments. ... For evaluation, we run MCTS five times and report the lowest value...