Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Planning and Learning in Average Risk-aware MDPs
Authors: Weikai Wang, Erick Delage
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerical experiments validate our analysis, confirm empirically the convergence of the off-policy algorithm, and demonstrate that our approach enables the identification of policies that are finely tuned to the intricate risk-awareness of the agent that they serve. Section 5 presents the numerical experiments. |
| Researcher Affiliation | Academia | 1 GERAD & HEC Montr eal 2Mila Qu ebec AI Institute EMAIL |
| Pseudocode | Yes | Pseudo-codes, proofs, and additional experiment details and results are provided in the appendix. |
| Open Source Code | Yes | We provide the code for implementation in the anonymous repository: https://anonymous.4open.science/r/P-L_ARMDP-Neur IPS2025-3471. |
| Open Datasets | No | We begin by validating the convergence of the risk-aware RVI Q-learning algorithm (4.1) using a randomly generated MDP with 10 states and 5 actions per state. The nominal transition kernel P is generated from a uniform distribution over [0, 1] and subsequently normalized. The cost function is sampled from a normal distribution N(1, 1). |
| Dataset Splits | No | The paper describes generating synthetic data and environments for its experiments (e.g., 'randomly generated MDP', 'degradation probabilities are generated randomly', 'probability of the incoming water level is randomly generated', 'probability of the incoming demand is generated randomly') rather than using pre-existing datasets with defined splits. |
| Hardware Specification | Yes | All the experiments were carried out using Python 3.9 on a Linux server equipped with a 64-core AMD EPYC 7763 processor. |
| Software Dependencies | Yes | All the experiments were carried out using Python 3.9 on a Linux server equipped with a 64-core AMD EPYC 7763 processor. |
| Experiment Setup | Yes | We run the MLMC Q-learning algorithm 100 times independently with r = 0.49 and plot the mean value of f(Qn) in Figure 5.1, with the 95th and 5th percentiles as the confidence interval (CI). For the parameters, we define a scenario with 30 degradation states, where state 0 represents a fully new machine and state 29 corresponds to a failure. The replacement cost is set to 301.5, the operating cost is 1 s, and the maintenance cost is 0.5 s1.5, where s denotes the current state level. Additionally, the failure cost is twice the replacement cost, ensuring significant penalties for machine failure. |