Conservative Q-Learning for Offline Reinforcement Learning

Authors: Aviral Kumar, Aurick Zhou, George Tucker, Sergey Levine

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 6 Experimental Evaluation We compare CQL to prior offline RL methods on a range of domains and dataset compositions, including continuous and discrete action spaces, state observations of varying dimensionality, and high-dimensional image inputs.
Researcher Affiliation Collaboration Aviral Kumar1, Aurick Zhou1, George Tucker2, Sergey Levine1,2 1UC Berkeley, 2Google Research, Brain Team
Pseudocode Yes Algorithm 1 Conservative Q-Learning (both variants)
Open Source Code No The paper states 'Our algorithm requires an addition of only 20 lines of code on top of standard implementations of soft actor-critic (SAC) [19] for continuous control experiments and on top of QR-DQN [8] for the discrete control.' but does not provide a concrete link to their source code or an explicit statement of its release.
Open Datasets Yes We first evaluate actor-critic CQL, using CQL(H) from Algorithm 1, on continuous control datasets from the D4RL benchmark [12]. ... using the dataset released by the authors [3].
Dataset Splits No The paper uses standard benchmarks like D4RL but does not explicitly state the specific training/validation/test dataset splits (percentages or counts) used for their experiments within the main text.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory used for running the experiments.
Software Dependencies No The paper mentions using standard implementations of soft actor-critic (SAC) and QR-DQN, but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes We use default hyperparameters from SAC, except that the learning rate for the policy was chosen from {3e-5, 1e-4, 3e-4}, and is less than or equal to the Q-function, as dictated by Theorem 3.3. Elaborate details are provided in Appendix F.