Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Conservative Q-Learning for Offline Reinforcement Learning
Authors: Aviral Kumar, Aurick Zhou, George Tucker, Sergey Levine
NeurIPS 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 6 Experimental Evaluation We compare CQL to prior offline RL methods on a range of domains and dataset compositions, including continuous and discrete action spaces, state observations of varying dimensionality, and high-dimensional image inputs. |
| Researcher Affiliation | Collaboration | Aviral Kumar1, Aurick Zhou1, George Tucker2, Sergey Levine1,2 1UC Berkeley, 2Google Research, Brain Team |
| Pseudocode | Yes | Algorithm 1 Conservative Q-Learning (both variants) |
| Open Source Code | No | The paper states 'Our algorithm requires an addition of only 20 lines of code on top of standard implementations of soft actor-critic (SAC) [19] for continuous control experiments and on top of QR-DQN [8] for the discrete control.' but does not provide a concrete link to their source code or an explicit statement of its release. |
| Open Datasets | Yes | We first evaluate actor-critic CQL, using CQL(H) from Algorithm 1, on continuous control datasets from the D4RL benchmark [12]. ... using the dataset released by the authors [3]. |
| Dataset Splits | No | The paper uses standard benchmarks like D4RL but does not explicitly state the specific training/validation/test dataset splits (percentages or counts) used for their experiments within the main text. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory used for running the experiments. |
| Software Dependencies | No | The paper mentions using standard implementations of soft actor-critic (SAC) and QR-DQN, but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | We use default hyperparameters from SAC, except that the learning rate for the policy was chosen from {3e-5, 1e-4, 3e-4}, and is less than or equal to the Q-function, as dictated by Theorem 3.3. Elaborate details are provided in Appendix F. |