Continuous Doubly Constrained Batch Reinforcement Learning
Authors: Rasool Fakoor, Jonas W. Mueller, Kavosh Asadi, Pratik Chaudhari, Alexander J. Smola
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Over a comprehensive set of 32 continuous-action batch RL benchmarks, our approach compares favorably to state-of-the-art methods, regardless of how the offline data were collected.In this section, we evaluate our CDC algorithm against existing methods on 32 tasks from the D4RL benchmark [14]. We also investigate the utility of individual CDC regularizers through ablation analyses, and demonstrate the broader applicability of our extra-overestimation penalty to off-policy evaluation in addition to batch RL. |
| Researcher Affiliation | Collaboration | 1Amazon Web Services, 2University of Pennsylvania |
| Pseudocode | Yes | Algorithm 1 Continuous Doubly Constrained Batch RL |
| Open Source Code | No | The paper does not provide a concrete access link or explicit statement about the availability of the source code for the methodology described in this paper. It only mentions the code for a baseline method (CQL) from a reference: "The results for CQL are taken from the official author-provided codes [https://github.com/aviralkumar2907/CQL] of [29]." |
| Open Datasets | Yes | We compare CDC against existing batch RL methods... on 32 tasks from the D4RL benchmark [14]." Reference [14]: "J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine. D4rl: Datasets for deep data-driven reinforcement learning. ar Xiv:2004.07219, 2020." |
| Dataset Splits | No | The paper states: "Our training/evaluation setup exactly follows existing work [14, 17, 28, 29]," but it does not provide specific details on the training, validation, and test dataset splits (e.g., percentages, sample counts, or explicit citations to predefined splits) within its own text. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | In CDC, we can simply utilize the same moderately conservative value of ν = 0.75 used by [17]... CDC is able to achieve strong performance with a small ensemble of M = 4 Q-networks (used throughout this work)... Throughout, we use η = 0 & λ = 0 to refer to this baseline framework (without our proposed penalties)... |