Efficient Risk-Averse Reinforcement Learning
Authors: Ido Greenberg, Yinlam Chow, Mohammad Ghavamzadeh, Shie Mannor
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate improved risk aversion in maze navigation, autonomous driving, and resource allocation benchmarks, including in scenarios where standard risk-averse PG completely fails. Our results and Ce So R implementation are available on Github. We conduct experiments in 3 different domains. |
| Researcher Affiliation | Collaboration | Ido Greenberg Technion gido@campus.technion.ac.il Yinlam Chow Google Research yinlamchow@google.com Mohammad Ghavamzadeh Google Research ghavamza@google.com Shie Mannor Technion, Nvidia Research shie@ee.technion.ac.il |
| Pseudocode | Yes | Algorithm 1: Ce So R |
| Open Source Code | Yes | Our results and Ce So R implementation are available on Github. The stand-alone cross entropy module is available on Py PI. |
| Open Datasets | No | We conduct experiments in 3 different domains. We implement Ce So R on top of a standard CVa R-PG method, which is also used as a risk-averse baseline for comparison. Specifically, we use the standard GCVa R [Tamar et al., 2015b], which guarantees convenient convergence properties (see Appendix C) and is simple to implement and analyze. We also use the standard policy gradient (PG) as a risk-neutral baseline. |
| Dataset Splits | No | Every 10 steps we run validation episodes, and we choose the final policy according to the best validation score (best mean for PG, best CVa R for GCVa R and Ce So R). |
| Hardware Specification | Yes | In each of the 3 domains, the experiments required a running time of a few hours on an Ubuntu machine with eight i9-10900X CPU cores. |
| Software Dependencies | No | In all the experiments, all agents are trained using Adam [Diederik P. Kingma, 2014]. |
| Experiment Setup | Yes | In all the experiments, all agents are trained using Adam [Diederik P. Kingma, 2014], with a learning rate selected manually per benchmark and N = 400 episodes per training step. Every 10 steps we run validation episodes, and we choose the final policy according to the best validation score (best mean for PG, best CVa R for GCVa R and Ce So R). For Ce So R, unless specified otherwise, ν = 20% of the trajectories per batch are drawn from the original distribution Dϕ0; β = 20% are used for the CE update; and the soft risk level reaches α after ρ = 80% of the training. As mentioned in Section 4, for numerical stability, we also clip the IS weights (Algorithm 1, Line 9) to the range [1/5, 5]. |