Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Planning and Learning in Average Risk-aware MDPs

Authors: Weikai Wang, Erick Delage

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Numerical experiments validate our analysis, confirm empirically the convergence of the off-policy algorithm, and demonstrate that our approach enables the identification of policies that are finely tuned to the intricate risk-awareness of the agent that they serve. Section 5 presents the numerical experiments.
Researcher Affiliation	Academia	1 GERAD & HEC Montr eal 2Mila Qu ebec AI Institute EMAIL
Pseudocode	Yes	Pseudo-codes, proofs, and additional experiment details and results are provided in the appendix.
Open Source Code	Yes	We provide the code for implementation in the anonymous repository: https://anonymous.4open.science/r/P-L_ARMDP-Neur IPS2025-3471.
Open Datasets	No	We begin by validating the convergence of the risk-aware RVI Q-learning algorithm (4.1) using a randomly generated MDP with 10 states and 5 actions per state. The nominal transition kernel P is generated from a uniform distribution over [0, 1] and subsequently normalized. The cost function is sampled from a normal distribution N(1, 1).
Dataset Splits	No	The paper describes generating synthetic data and environments for its experiments (e.g., 'randomly generated MDP', 'degradation probabilities are generated randomly', 'probability of the incoming water level is randomly generated', 'probability of the incoming demand is generated randomly') rather than using pre-existing datasets with defined splits.
Hardware Specification	Yes	All the experiments were carried out using Python 3.9 on a Linux server equipped with a 64-core AMD EPYC 7763 processor.
Software Dependencies	Yes	All the experiments were carried out using Python 3.9 on a Linux server equipped with a 64-core AMD EPYC 7763 processor.
Experiment Setup	Yes	We run the MLMC Q-learning algorithm 100 times independently with r = 0.49 and plot the mean value of f(Qn) in Figure 5.1, with the 95th and 5th percentiles as the confidence interval (CI). For the parameters, we define a scenario with 30 degradation states, where state 0 represents a fully new machine and state 29 corresponds to a failure. The replacement cost is set to 301.5, the operating cost is 1 s, and the maintenance cost is 0.5 s1.5, where s denotes the current state level. Additionally, the failure cost is twice the replacement cost, ensuring significant penalties for machine failure.