Conservative Offline Distributional Reinforcement Learning

Authors: Yecheng Ma, Dinesh Jayaraman, Osbert Bastani

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, on two challenging robot navigation tasks, CODAC successfully learns risk-averse policies using offline data collected purely from risk-neutral agents. Furthermore, CODAC is state-of-the-art on the D4RL Mu Jo Co benchmark in terms of both expected and risk-sensitive performance.
Researcher Affiliation Academia Yecheng Jason Ma, Dinesh Jayaraman, Osbert Bastani University of Pennsylvania {jasonyma, dineshj, obastani}@seas.upenn.edu
Pseudocode Yes We provide the full CODAC pseudocode in Algorithm 1 of Appendix B.
Open Source Code Yes Code is available at: https://github.com/Jason Ma2016/CODAC
Open Datasets Yes Next, we consider stochastic D4RL [43]. The original D4RL benchmark [10] consists of datasets collected by SAC agents of varying performance (Mixed, Medium, and Expert) on the Hopper, Walker2d, and Half Cheetah Mu Jo Co environments [41];
Dataset Splits No The paper mentions using an 'offline dataset' and evaluating on 'test episodes', but it does not provide specific percentages or counts for training, validation, and test splits within the main text.
Hardware Specification No The paper does not explicitly describe the hardware used for running experiments, such as specific GPU or CPU models.
Software Dependencies No The paper mentions software like 'distributional soft actor-critic (DSAC)', 'SAC', and 'Mu Jo Co environments', but it does not provide specific version numbers for any software dependencies.
Experiment Setup Yes We provide additional details (e.g., environment descriptions, hyperparameters, and additional results) in Appendix C. ... For all experiments in this section, we used a batch size of 256, a discount factor of 0.99, a quantile regularization magnitude of 0.2, a target update rate of 0.005, and 32 quantiles in each of the networks. The neural network architecture is a 3-layer MLP with 256 hidden units and ReLU activations. We used a temperature of 0.1 for the Risky Point Mass and 1.0 for the Risky Ant, and a conservative penalty (lambda) of 5.0 for the Risky Point Mass and 10.0 for the Risky Ant.