Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Should I Run Offline Reinforcement Learning or Behavioral Cloning?
Authors: Aviral Kumar, Joey Hong, Anikait Singh, Sergey Levine
ICLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our theoretical results via extensive experiments on both diagnostic and high-dimensional domains including robotic manipulation, maze navigation, and Atari games, with a variety of data distributions. We observe that, under specific but common conditions such as sparse rewards or noisy data sources, modern offline RL methods can significantly outperform BC. |
| Researcher Affiliation | Collaboration | Aviral Kumar ,1,2, Joey Hong ,1, Anikait Singh1, Sergey Levine1,2 1Department of EECS, UC Berkeley 2Google Research ( Equal Contribution) |
| Pseudocode | Yes | A PSEUDOCODE FOR ALGORITHMS Algorithm 1 Conservative Offline RL Algorithm Algorithm 2 Policy-Constraint Offline RL Algorithm |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for the described methodology or a direct link to a code repository. |
| Open Datasets | Yes | We consider a diverse set of domains and behavior policies that are representative of practical scenarios: multi-stage robotic manipulation tasks from state (Adroit domains from Fu et al. [14]) and image observations [60], antmaze navigation [14], and 7 Atari games [3]. We use the scripted expert provided by Fu et al. [14] for antmaze and those provided by Singh et al. [60] for manipulation, an RL-trained expert for Atari, and human expert for Adroit [50]. |
| Dataset Splits | No | The paper mentions using a validation set in the context of hyperparameter tuning for BC ("early stopping based on validation losses") but does not provide specific split percentages or sample counts for train/validation/test splits for any dataset. |
| Hardware Specification | No | The paper does not specify the exact hardware components (e.g., specific GPU or CPU models) used for running the experiments. |
| Software Dependencies | No | The paper mentions specific algorithms like CQL, but does not list software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x). |
| Experiment Setup | Yes | We used default hyperparameters for the CQL algorithm (Q-function learning rate = 3e-4, policy learning rate = 1e-4), based on prior works that utilize these domains. ...with regards to the hyperaprameter α in CQL... we used α = 0.1 for all Atari games... and α = 1.0 for the robotic manipulation domains... For the Antmaze and Adroit domains, we ran CQL training with multiple values of α {0.01, 0.1, 0.5, 1.0, 5.0, 10.0, 20.0}. |