reproducibilityindex.ai

Risk-Averse Offline Reinforcement Learning

Authors: Núria Armengol Urpí, Sebastian Curi, Andreas Krause

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that O-RAAC learns policies with higher risk-averse performance than risk-neutral approaches in different robot control tasks. Furthermore, considering risk-averse criteria guarantees distributional robustness of the average performance with respect to particular distribution shifts. We demonstrate empirically that in the presence of natural distribution-shifts, O-RAAC learns policies with good average performance. ... In this section, we test the performance of O-RAAC using D = CVa Rα=0.1 as risk distortion.
Researcher Affiliation	Academia	N uria Armengol Urp ı Dept. of Computer Science ETH Zurich narmengolurpi@gmail.com Sebastian Curi Dept. of Computer Science ETH Zurich scuri@inf.ethz.ch Andreas Krause Dept. of Computer Science ETH Zurich krausea@ethz.ch
Pseudocode	Yes	Algorithm 1: Ofﬂine Risk-Averse Actor Critic (O-RAAC).
Open Source Code	Yes	Our implementation is freely available at Github: https://github.com/nuria95/O-RAAC.
Open Datasets	Yes	We test the algorithm on a variety of continuous control benchmark tasks on the data provided in the D4RL dataset (Fu et al., 2020).
Dataset Splits	Yes	Hence, to estimate the returns, we use the state-action distribution in the data set and split it into chunks of 200 time steps for the Half-Cheetah and 500 time steps for the Walker2D and the Hopper. We then compute the return of every chunk by sampling a realization from its stochastic reward function. Finally, we bootstrap the resulting chunks into 10 datasets by sampling uniformly at random with replacement and estimate the mean and CVa R0.1 of the returns in each batch.
Hardware Specification	No	The paper does not provide specific details on the hardware used, such as CPU or GPU models, or memory specifications.
Software Dependencies	No	The paper mentions using Adam for optimization but does not provide specific version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or programming languages.
Experiment Setup	Yes	All the network parameters are updated using Adam (Kingma & Ba, 2015) with learning rates η = 0.001 for the critic and the VAE, and η = 0.0001 for the actor model, as in Fujimoto et al. (2019). The target networks for the critic and the perturbation models are updated softly with µ = 0.005. For the critic loss (4) we use N = N = 32 quantile samples, whereas to approximate the CVa R to compute the actor loss (5) (7) we use 8 samples from the uniform distribution between [0, 0.1]. ... For all Mu Jo Co experiments, the λ parameter which modulates the action perturbation level was experimentally set to 0.25, except for the Half Cheetah-medium experiment for which it was set to 0.5.