reproducibilityindex.ai

Efficient Risk-Averse Reinforcement Learning

Authors: Ido Greenberg, Yinlam Chow, Mohammad Ghavamzadeh, Shie Mannor

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate improved risk aversion in maze navigation, autonomous driving, and resource allocation benchmarks, including in scenarios where standard risk-averse PG completely fails. Our results and Ce So R implementation are available on Github. We conduct experiments in 3 different domains.
Researcher Affiliation	Collaboration	Ido Greenberg Technion gido@campus.technion.ac.il Yinlam Chow Google Research yinlamchow@google.com Mohammad Ghavamzadeh Google Research ghavamza@google.com Shie Mannor Technion, Nvidia Research shie@ee.technion.ac.il
Pseudocode	Yes	Algorithm 1: Ce So R
Open Source Code	Yes	Our results and Ce So R implementation are available on Github. The stand-alone cross entropy module is available on Py PI.
Open Datasets	No	We conduct experiments in 3 different domains. We implement Ce So R on top of a standard CVa R-PG method, which is also used as a risk-averse baseline for comparison. Specifically, we use the standard GCVa R [Tamar et al., 2015b], which guarantees convenient convergence properties (see Appendix C) and is simple to implement and analyze. We also use the standard policy gradient (PG) as a risk-neutral baseline.
Dataset Splits	No	Every 10 steps we run validation episodes, and we choose the final policy according to the best validation score (best mean for PG, best CVa R for GCVa R and Ce So R).
Hardware Specification	Yes	In each of the 3 domains, the experiments required a running time of a few hours on an Ubuntu machine with eight i9-10900X CPU cores.
Software Dependencies	No	In all the experiments, all agents are trained using Adam [Diederik P. Kingma, 2014].
Experiment Setup	Yes	In all the experiments, all agents are trained using Adam [Diederik P. Kingma, 2014], with a learning rate selected manually per benchmark and N = 400 episodes per training step. Every 10 steps we run validation episodes, and we choose the final policy according to the best validation score (best mean for PG, best CVa R for GCVa R and Ce So R). For Ce So R, unless specified otherwise, ν = 20% of the trajectories per batch are drawn from the original distribution Dϕ0; β = 20% are used for the CE update; and the soft risk level reaches α after ρ = 80% of the training. As mentioned in Section 4, for numerical stability, we also clip the IS weights (Algorithm 1, Line 9) to the range [1/5, 5].