Should I Run Offline Reinforcement Learning or Behavioral Cloning?

Authors: Aviral Kumar, Joey Hong, Anikait Singh, Sergey Levine

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our theoretical results via extensive experiments on both diagnostic and high-dimensional domains including robotic manipulation, maze navigation, and Atari games, with a variety of data distributions. We observe that, under specific but common conditions such as sparse rewards or noisy data sources, modern offline RL methods can significantly outperform BC.
Researcher Affiliation Collaboration Aviral Kumar ,1,2, Joey Hong ,1, Anikait Singh1, Sergey Levine1,2 1Department of EECS, UC Berkeley 2Google Research ( Equal Contribution)
Pseudocode Yes A PSEUDOCODE FOR ALGORITHMS Algorithm 1 Conservative Offline RL Algorithm Algorithm 2 Policy-Constraint Offline RL Algorithm
Open Source Code No The paper does not contain an explicit statement about releasing source code for the described methodology or a direct link to a code repository.
Open Datasets Yes We consider a diverse set of domains and behavior policies that are representative of practical scenarios: multi-stage robotic manipulation tasks from state (Adroit domains from Fu et al. [14]) and image observations [60], antmaze navigation [14], and 7 Atari games [3]. We use the scripted expert provided by Fu et al. [14] for antmaze and those provided by Singh et al. [60] for manipulation, an RL-trained expert for Atari, and human expert for Adroit [50].
Dataset Splits No The paper mentions using a validation set in the context of hyperparameter tuning for BC ("early stopping based on validation losses") but does not provide specific split percentages or sample counts for train/validation/test splits for any dataset.
Hardware Specification No The paper does not specify the exact hardware components (e.g., specific GPU or CPU models) used for running the experiments.
Software Dependencies No The paper mentions specific algorithms like CQL, but does not list software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup Yes We used default hyperparameters for the CQL algorithm (Q-function learning rate = 3e-4, policy learning rate = 1e-4), based on prior works that utilize these domains. ...with regards to the hyperaprameter α in CQL... we used α = 0.1 for all Atari games... and α = 1.0 for the robotic manipulation domains... For the Antmaze and Adroit domains, we ran CQL training with multiple values of α {0.01, 0.1, 0.5, 1.0, 5.0, 10.0, 20.0}.