Conservative Q-Learning for Offline Reinforcement Learning
Authors: Aviral Kumar, Aurick Zhou, George Tucker, Sergey Levine
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 6 Experimental Evaluation We compare CQL to prior offline RL methods on a range of domains and dataset compositions, including continuous and discrete action spaces, state observations of varying dimensionality, and high-dimensional image inputs. |
| Researcher Affiliation | Collaboration | Aviral Kumar1, Aurick Zhou1, George Tucker2, Sergey Levine1,2 1UC Berkeley, 2Google Research, Brain Team |
| Pseudocode | Yes | Algorithm 1 Conservative Q-Learning (both variants) |
| Open Source Code | No | The paper states 'Our algorithm requires an addition of only 20 lines of code on top of standard implementations of soft actor-critic (SAC) [19] for continuous control experiments and on top of QR-DQN [8] for the discrete control.' but does not provide a concrete link to their source code or an explicit statement of its release. |
| Open Datasets | Yes | We first evaluate actor-critic CQL, using CQL(H) from Algorithm 1, on continuous control datasets from the D4RL benchmark [12]. ... using the dataset released by the authors [3]. |
| Dataset Splits | No | The paper uses standard benchmarks like D4RL but does not explicitly state the specific training/validation/test dataset splits (percentages or counts) used for their experiments within the main text. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory used for running the experiments. |
| Software Dependencies | No | The paper mentions using standard implementations of soft actor-critic (SAC) and QR-DQN, but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | We use default hyperparameters from SAC, except that the learning rate for the policy was chosen from {3e-5, 1e-4, 3e-4}, and is less than or equal to the Q-function, as dictated by Theorem 3.3. Elaborate details are provided in Appendix F. |