Should I Run Offline Reinforcement Learning or Behavioral Cloning?
Authors: Aviral Kumar, Joey Hong, Anikait Singh, Sergey Levine
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our theoretical results via extensive experiments on both diagnostic and high-dimensional domains including robotic manipulation, maze navigation, and Atari games, with a variety of data distributions. We observe that, under specific but common conditions such as sparse rewards or noisy data sources, modern offline RL methods can significantly outperform BC. |
| Researcher Affiliation | Collaboration | Aviral Kumar ,1,2, Joey Hong ,1, Anikait Singh1, Sergey Levine1,2 1Department of EECS, UC Berkeley 2Google Research ( Equal Contribution) |
| Pseudocode | Yes | A PSEUDOCODE FOR ALGORITHMS Algorithm 1 Conservative Offline RL Algorithm Algorithm 2 Policy-Constraint Offline RL Algorithm |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for the described methodology or a direct link to a code repository. |
| Open Datasets | Yes | We consider a diverse set of domains and behavior policies that are representative of practical scenarios: multi-stage robotic manipulation tasks from state (Adroit domains from Fu et al. [14]) and image observations [60], antmaze navigation [14], and 7 Atari games [3]. We use the scripted expert provided by Fu et al. [14] for antmaze and those provided by Singh et al. [60] for manipulation, an RL-trained expert for Atari, and human expert for Adroit [50]. |
| Dataset Splits | No | The paper mentions using a validation set in the context of hyperparameter tuning for BC ("early stopping based on validation losses") but does not provide specific split percentages or sample counts for train/validation/test splits for any dataset. |
| Hardware Specification | No | The paper does not specify the exact hardware components (e.g., specific GPU or CPU models) used for running the experiments. |
| Software Dependencies | No | The paper mentions specific algorithms like CQL, but does not list software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x). |
| Experiment Setup | Yes | We used default hyperparameters for the CQL algorithm (Q-function learning rate = 3e-4, policy learning rate = 1e-4), based on prior works that utilize these domains. ...with regards to the hyperaprameter α in CQL... we used α = 0.1 for all Atari games... and α = 1.0 for the robotic manipulation domains... For the Antmaze and Adroit domains, we ran CQL training with multiple values of α {0.01, 0.1, 0.5, 1.0, 5.0, 10.0, 20.0}. |