Risk-Averse Offline Reinforcement Learning

Authors: Núria Armengol Urpí, Sebastian Curi, Andreas Krause

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that O-RAAC learns policies with higher risk-averse performance than risk-neutral approaches in different robot control tasks. Furthermore, considering risk-averse criteria guarantees distributional robustness of the average performance with respect to particular distribution shifts. We demonstrate empirically that in the presence of natural distribution-shifts, O-RAAC learns policies with good average performance. ... In this section, we test the performance of O-RAAC using D = CVa Rα=0.1 as risk distortion.
Researcher Affiliation Academia N uria Armengol Urp ı Dept. of Computer Science ETH Zurich narmengolurpi@gmail.com Sebastian Curi Dept. of Computer Science ETH Zurich scuri@inf.ethz.ch Andreas Krause Dept. of Computer Science ETH Zurich krausea@ethz.ch
Pseudocode Yes Algorithm 1: Offline Risk-Averse Actor Critic (O-RAAC).
Open Source Code Yes Our implementation is freely available at Github: https://github.com/nuria95/O-RAAC.
Open Datasets Yes We test the algorithm on a variety of continuous control benchmark tasks on the data provided in the D4RL dataset (Fu et al., 2020).
Dataset Splits Yes Hence, to estimate the returns, we use the state-action distribution in the data set and split it into chunks of 200 time steps for the Half-Cheetah and 500 time steps for the Walker2D and the Hopper. We then compute the return of every chunk by sampling a realization from its stochastic reward function. Finally, we bootstrap the resulting chunks into 10 datasets by sampling uniformly at random with replacement and estimate the mean and CVa R0.1 of the returns in each batch.
Hardware Specification No The paper does not provide specific details on the hardware used, such as CPU or GPU models, or memory specifications.
Software Dependencies No The paper mentions using Adam for optimization but does not provide specific version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or programming languages.
Experiment Setup Yes All the network parameters are updated using Adam (Kingma & Ba, 2015) with learning rates η = 0.001 for the critic and the VAE, and η = 0.0001 for the actor model, as in Fujimoto et al. (2019). The target networks for the critic and the perturbation models are updated softly with µ = 0.005. For the critic loss (4) we use N = N = 32 quantile samples, whereas to approximate the CVa R to compute the actor loss (5) (7) we use 8 samples from the uniform distribution between [0, 0.1]. ... For all Mu Jo Co experiments, the λ parameter which modulates the action perturbation level was experimentally set to 0.25, except for the Half Cheetah-medium experiment for which it was set to 0.5.