Efficient Risk-Averse Reinforcement Learning

Authors: Ido Greenberg, Yinlam Chow, Mohammad Ghavamzadeh, Shie Mannor

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate improved risk aversion in maze navigation, autonomous driving, and resource allocation benchmarks, including in scenarios where standard risk-averse PG completely fails. Our results and Ce So R implementation are available on Github. We conduct experiments in 3 different domains.
Researcher Affiliation Collaboration Ido Greenberg Technion gido@campus.technion.ac.il Yinlam Chow Google Research yinlamchow@google.com Mohammad Ghavamzadeh Google Research ghavamza@google.com Shie Mannor Technion, Nvidia Research shie@ee.technion.ac.il
Pseudocode Yes Algorithm 1: Ce So R
Open Source Code Yes Our results and Ce So R implementation are available on Github. The stand-alone cross entropy module is available on Py PI.
Open Datasets No We conduct experiments in 3 different domains. We implement Ce So R on top of a standard CVa R-PG method, which is also used as a risk-averse baseline for comparison. Specifically, we use the standard GCVa R [Tamar et al., 2015b], which guarantees convenient convergence properties (see Appendix C) and is simple to implement and analyze. We also use the standard policy gradient (PG) as a risk-neutral baseline.
Dataset Splits No Every 10 steps we run validation episodes, and we choose the final policy according to the best validation score (best mean for PG, best CVa R for GCVa R and Ce So R).
Hardware Specification Yes In each of the 3 domains, the experiments required a running time of a few hours on an Ubuntu machine with eight i9-10900X CPU cores.
Software Dependencies No In all the experiments, all agents are trained using Adam [Diederik P. Kingma, 2014].
Experiment Setup Yes In all the experiments, all agents are trained using Adam [Diederik P. Kingma, 2014], with a learning rate selected manually per benchmark and N = 400 episodes per training step. Every 10 steps we run validation episodes, and we choose the final policy according to the best validation score (best mean for PG, best CVa R for GCVa R and Ce So R). For Ce So R, unless specified otherwise, ν = 20% of the trajectories per batch are drawn from the original distribution Dϕ0; β = 20% are used for the CE update; and the soft risk level reaches α after ρ = 80% of the training. As mentioned in Section 4, for numerical stability, we also clip the IS weights (Algorithm 1, Line 9) to the range [1/5, 5].