Risk-Averse Offline Reinforcement Learning
Authors: Núria Armengol Urpí, Sebastian Curi, Andreas Krause
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that O-RAAC learns policies with higher risk-averse performance than risk-neutral approaches in different robot control tasks. Furthermore, considering risk-averse criteria guarantees distributional robustness of the average performance with respect to particular distribution shifts. We demonstrate empirically that in the presence of natural distribution-shifts, O-RAAC learns policies with good average performance. ... In this section, we test the performance of O-RAAC using D = CVa Rα=0.1 as risk distortion. |
| Researcher Affiliation | Academia | N uria Armengol Urp ı Dept. of Computer Science ETH Zurich narmengolurpi@gmail.com Sebastian Curi Dept. of Computer Science ETH Zurich scuri@inf.ethz.ch Andreas Krause Dept. of Computer Science ETH Zurich krausea@ethz.ch |
| Pseudocode | Yes | Algorithm 1: Offline Risk-Averse Actor Critic (O-RAAC). |
| Open Source Code | Yes | Our implementation is freely available at Github: https://github.com/nuria95/O-RAAC. |
| Open Datasets | Yes | We test the algorithm on a variety of continuous control benchmark tasks on the data provided in the D4RL dataset (Fu et al., 2020). |
| Dataset Splits | Yes | Hence, to estimate the returns, we use the state-action distribution in the data set and split it into chunks of 200 time steps for the Half-Cheetah and 500 time steps for the Walker2D and the Hopper. We then compute the return of every chunk by sampling a realization from its stochastic reward function. Finally, we bootstrap the resulting chunks into 10 datasets by sampling uniformly at random with replacement and estimate the mean and CVa R0.1 of the returns in each batch. |
| Hardware Specification | No | The paper does not provide specific details on the hardware used, such as CPU or GPU models, or memory specifications. |
| Software Dependencies | No | The paper mentions using Adam for optimization but does not provide specific version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or programming languages. |
| Experiment Setup | Yes | All the network parameters are updated using Adam (Kingma & Ba, 2015) with learning rates η = 0.001 for the critic and the VAE, and η = 0.0001 for the actor model, as in Fujimoto et al. (2019). The target networks for the critic and the perturbation models are updated softly with µ = 0.005. For the critic loss (4) we use N = N = 32 quantile samples, whereas to approximate the CVa R to compute the actor loss (5) (7) we use 8 samples from the uniform distribution between [0, 0.1]. ... For all Mu Jo Co experiments, the λ parameter which modulates the action perturbation level was experimentally set to 0.25, except for the Half Cheetah-medium experiment for which it was set to 0.5. |