Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Efficient Risk-Averse Reinforcement Learning
Authors: Ido Greenberg, Yinlam Chow, Mohammad Ghavamzadeh, Shie Mannor
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate improved risk aversion in maze navigation, autonomous driving, and resource allocation benchmarks, including in scenarios where standard risk-averse PG completely fails. Our results and Ce So R implementation are available on Github. We conduct experiments in 3 different domains. |
| Researcher Affiliation | Collaboration | Ido Greenberg Technion EMAIL Yinlam Chow Google Research EMAIL Mohammad Ghavamzadeh Google Research EMAIL Shie Mannor Technion, Nvidia Research EMAIL |
| Pseudocode | Yes | Algorithm 1: Ce So R |
| Open Source Code | Yes | Our results and Ce So R implementation are available on Github. The stand-alone cross entropy module is available on Py PI. |
| Open Datasets | No | We conduct experiments in 3 different domains. We implement Ce So R on top of a standard CVa R-PG method, which is also used as a risk-averse baseline for comparison. Specifically, we use the standard GCVa R [Tamar et al., 2015b], which guarantees convenient convergence properties (see Appendix C) and is simple to implement and analyze. We also use the standard policy gradient (PG) as a risk-neutral baseline. |
| Dataset Splits | No | Every 10 steps we run validation episodes, and we choose the final policy according to the best validation score (best mean for PG, best CVa R for GCVa R and Ce So R). |
| Hardware Specification | Yes | In each of the 3 domains, the experiments required a running time of a few hours on an Ubuntu machine with eight i9-10900X CPU cores. |
| Software Dependencies | No | In all the experiments, all agents are trained using Adam [Diederik P. Kingma, 2014]. |
| Experiment Setup | Yes | In all the experiments, all agents are trained using Adam [Diederik P. Kingma, 2014], with a learning rate selected manually per benchmark and N = 400 episodes per training step. Every 10 steps we run validation episodes, and we choose the final policy according to the best validation score (best mean for PG, best CVa R for GCVa R and Ce So R). For Ce So R, unless specified otherwise, ν = 20% of the trajectories per batch are drawn from the original distribution Dϕ0; β = 20% are used for the CE update; and the soft risk level reaches α after ρ = 80% of the training. As mentioned in Section 4, for numerical stability, we also clip the IS weights (Algorithm 1, Line 9) to the range [1/5, 5]. |