Conservative Offline Distributional Reinforcement Learning
Authors: Yecheng Ma, Dinesh Jayaraman, Osbert Bastani
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, on two challenging robot navigation tasks, CODAC successfully learns risk-averse policies using offline data collected purely from risk-neutral agents. Furthermore, CODAC is state-of-the-art on the D4RL Mu Jo Co benchmark in terms of both expected and risk-sensitive performance. |
| Researcher Affiliation | Academia | Yecheng Jason Ma, Dinesh Jayaraman, Osbert Bastani University of Pennsylvania {jasonyma, dineshj, obastani}@seas.upenn.edu |
| Pseudocode | Yes | We provide the full CODAC pseudocode in Algorithm 1 of Appendix B. |
| Open Source Code | Yes | Code is available at: https://github.com/Jason Ma2016/CODAC |
| Open Datasets | Yes | Next, we consider stochastic D4RL [43]. The original D4RL benchmark [10] consists of datasets collected by SAC agents of varying performance (Mixed, Medium, and Expert) on the Hopper, Walker2d, and Half Cheetah Mu Jo Co environments [41]; |
| Dataset Splits | No | The paper mentions using an 'offline dataset' and evaluating on 'test episodes', but it does not provide specific percentages or counts for training, validation, and test splits within the main text. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for running experiments, such as specific GPU or CPU models. |
| Software Dependencies | No | The paper mentions software like 'distributional soft actor-critic (DSAC)', 'SAC', and 'Mu Jo Co environments', but it does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | We provide additional details (e.g., environment descriptions, hyperparameters, and additional results) in Appendix C. ... For all experiments in this section, we used a batch size of 256, a discount factor of 0.99, a quantile regularization magnitude of 0.2, a target update rate of 0.005, and 32 quantiles in each of the networks. The neural network architecture is a 3-layer MLP with 256 hidden units and ReLU activations. We used a temperature of 0.1 for the Risky Point Mass and 1.0 for the Risky Ant, and a conservative penalty (lambda) of 5.0 for the Risky Point Mass and 10.0 for the Risky Ant. |