Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Risk-Averse Total-Reward Reinforcement Learning

Authors: Xihong Su, Jia Lin Hau, Gersi Doko, Kishan Panaganti, Marek Petrik

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our numerical results on tabular domains demonstrate quick and reliable convergence of the proposed Q-learning algorithm to the optimal risk-averse value function. ... 5 Numerical Evaluation In this section, we evaluate our algorithms on two tabular domains: cliff walking (CW) [4] and gambler s ruin (GR) [44].
Researcher Affiliation Academia Xihong Su Department of Computer Science University of New Hampshire Durham, NH 03824 EMAIL Jia Lin Hau Department of Computer Science University of New Hampshire Durham, NH 03824 EMAIL Gersi Doko Department of Computer Science University of New Hampshire Durham, NH 03824 EMAIL Kishan Panaganti Department of Computing & Mathematical Sciences California Institute of Technology (now at Tencent AI Lab, Seattle, WA) EMAIL Marek Petrik Department of Computer Science University of New Hampshire Durham, NH 03824 EMAIL
Pseudocode Yes Algorithm 1: ERM-TRC Q-learning algorithm ... Algorithm 2: EVa R-TRC Q-learning algorithm ... Algorithm 3: A heuristic algorithm for computing z bounds
Open Source Code Yes The source code is available at https://github.com/suxh2019/ERM_ EVa R_Q.
Open Datasets Yes In this section, we evaluate our algorithms on two tabular domains: cliff walking (CW) [4] and gambler s ruin (GR) [44].
Dataset Splits No In this problem, an agent starts with a random state (cell in the grid world) that is uniformly distributed over all non-sink states and walks toward the goal state labeled by g shown in Figure 1. ... we simulate the two optimal EVa R policies over 48, 000 episodes and display the distribution of returns in Figure 3. ... we use six random seeds to generate samples, compute the optimal policies, and calculate the EVa R values on the CW domain.
Hardware Specification Yes The machine used to conduct all experiments referenced in Section 5 is a single machine with the following specifications: AMD Ryzen Thread ripper 3970X 32-Core (64) @ 4.55 GHz 256 GB RAM Julia 1.11.5
Software Dependencies No The machine used to conduct all experiments referenced in Section 5 is a single machine with the following specifications: AMD Ryzen Thread ripper 3970X 32-Core (64) @ 4.55 GHz 256 GB RAM Julia 1.11.5
Experiment Setup No Figure 1 and Figure 2 show the optimal policies for EVa R with risk level α = 0.2 and α = 0.6 separately. ... For each episode, the agent takes 20, 000 steps and collects all rewards during the path. ... we simulate the two optimal EVa R policies over 48, 000 episodes ... The step sizes ηi to converge to the optimal state-action value function.