Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Risk-Averse Total-Reward Reinforcement Learning
Authors: Xihong Su, Jia Lin Hau, Gersi Doko, Kishan Panaganti, Marek Petrik
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our numerical results on tabular domains demonstrate quick and reliable convergence of the proposed Q-learning algorithm to the optimal risk-averse value function. ... 5 Numerical Evaluation In this section, we evaluate our algorithms on two tabular domains: cliff walking (CW) [4] and gambler s ruin (GR) [44]. |
| Researcher Affiliation | Academia | Xihong Su Department of Computer Science University of New Hampshire Durham, NH 03824 EMAIL Jia Lin Hau Department of Computer Science University of New Hampshire Durham, NH 03824 EMAIL Gersi Doko Department of Computer Science University of New Hampshire Durham, NH 03824 EMAIL Kishan Panaganti Department of Computing & Mathematical Sciences California Institute of Technology (now at Tencent AI Lab, Seattle, WA) EMAIL Marek Petrik Department of Computer Science University of New Hampshire Durham, NH 03824 EMAIL |
| Pseudocode | Yes | Algorithm 1: ERM-TRC Q-learning algorithm ... Algorithm 2: EVa R-TRC Q-learning algorithm ... Algorithm 3: A heuristic algorithm for computing z bounds |
| Open Source Code | Yes | The source code is available at https://github.com/suxh2019/ERM_ EVa R_Q. |
| Open Datasets | Yes | In this section, we evaluate our algorithms on two tabular domains: cliff walking (CW) [4] and gambler s ruin (GR) [44]. |
| Dataset Splits | No | In this problem, an agent starts with a random state (cell in the grid world) that is uniformly distributed over all non-sink states and walks toward the goal state labeled by g shown in Figure 1. ... we simulate the two optimal EVa R policies over 48, 000 episodes and display the distribution of returns in Figure 3. ... we use six random seeds to generate samples, compute the optimal policies, and calculate the EVa R values on the CW domain. |
| Hardware Specification | Yes | The machine used to conduct all experiments referenced in Section 5 is a single machine with the following specifications: AMD Ryzen Thread ripper 3970X 32-Core (64) @ 4.55 GHz 256 GB RAM Julia 1.11.5 |
| Software Dependencies | No | The machine used to conduct all experiments referenced in Section 5 is a single machine with the following specifications: AMD Ryzen Thread ripper 3970X 32-Core (64) @ 4.55 GHz 256 GB RAM Julia 1.11.5 |
| Experiment Setup | No | Figure 1 and Figure 2 show the optimal policies for EVa R with risk level α = 0.2 and α = 0.6 separately. ... For each episode, the agent takes 20, 000 steps and collects all rewards during the path. ... we simulate the two optimal EVa R policies over 48, 000 episodes ... The step sizes ηi to converge to the optimal state-action value function. |