reproducibilityindex.ai

Conservative Offline Distributional Reinforcement Learning

Authors: Yecheng Ma, Dinesh Jayaraman, Osbert Bastani

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments, on two challenging robot navigation tasks, CODAC successfully learns risk-averse policies using ofﬂine data collected purely from risk-neutral agents. Furthermore, CODAC is state-of-the-art on the D4RL Mu Jo Co benchmark in terms of both expected and risk-sensitive performance.
Researcher Affiliation	Academia	Yecheng Jason Ma, Dinesh Jayaraman, Osbert Bastani University of Pennsylvania {jasonyma, dineshj, obastani}@seas.upenn.edu
Pseudocode	Yes	We provide the full CODAC pseudocode in Algorithm 1 of Appendix B.
Open Source Code	Yes	Code is available at: https://github.com/Jason Ma2016/CODAC
Open Datasets	Yes	Next, we consider stochastic D4RL [43]. The original D4RL benchmark [10] consists of datasets collected by SAC agents of varying performance (Mixed, Medium, and Expert) on the Hopper, Walker2d, and Half Cheetah Mu Jo Co environments [41];
Dataset Splits	No	The paper mentions using an 'offline dataset' and evaluating on 'test episodes', but it does not provide specific percentages or counts for training, validation, and test splits within the main text.
Hardware Specification	No	The paper does not explicitly describe the hardware used for running experiments, such as specific GPU or CPU models.
Software Dependencies	No	The paper mentions software like 'distributional soft actor-critic (DSAC)', 'SAC', and 'Mu Jo Co environments', but it does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	We provide additional details (e.g., environment descriptions, hyperparameters, and additional results) in Appendix C. ... For all experiments in this section, we used a batch size of 256, a discount factor of 0.99, a quantile regularization magnitude of 0.2, a target update rate of 0.005, and 32 quantiles in each of the networks. The neural network architecture is a 3-layer MLP with 256 hidden units and ReLU activations. We used a temperature of 0.1 for the Risky Point Mass and 1.0 for the Risky Ant, and a conservative penalty (lambda) of 5.0 for the Risky Point Mass and 10.0 for the Risky Ant.