reproducibilityindex.ai

Should I Run Offline Reinforcement Learning or Behavioral Cloning?

Authors: Aviral Kumar, Joey Hong, Anikait Singh, Sergey Levine

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our theoretical results via extensive experiments on both diagnostic and high-dimensional domains including robotic manipulation, maze navigation, and Atari games, with a variety of data distributions. We observe that, under specific but common conditions such as sparse rewards or noisy data sources, modern offline RL methods can significantly outperform BC.
Researcher Affiliation	Collaboration	Aviral Kumar ,1,2, Joey Hong ,1, Anikait Singh1, Sergey Levine1,2 1Department of EECS, UC Berkeley 2Google Research ( Equal Contribution)
Pseudocode	Yes	A PSEUDOCODE FOR ALGORITHMS Algorithm 1 Conservative Offline RL Algorithm Algorithm 2 Policy-Constraint Offline RL Algorithm
Open Source Code	No	The paper does not contain an explicit statement about releasing source code for the described methodology or a direct link to a code repository.
Open Datasets	Yes	We consider a diverse set of domains and behavior policies that are representative of practical scenarios: multi-stage robotic manipulation tasks from state (Adroit domains from Fu et al. [14]) and image observations [60], antmaze navigation [14], and 7 Atari games [3]. We use the scripted expert provided by Fu et al. [14] for antmaze and those provided by Singh et al. [60] for manipulation, an RL-trained expert for Atari, and human expert for Adroit [50].
Dataset Splits	No	The paper mentions using a validation set in the context of hyperparameter tuning for BC ("early stopping based on validation losses") but does not provide specific split percentages or sample counts for train/validation/test splits for any dataset.
Hardware Specification	No	The paper does not specify the exact hardware components (e.g., specific GPU or CPU models) used for running the experiments.
Software Dependencies	No	The paper mentions specific algorithms like CQL, but does not list software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup	Yes	We used default hyperparameters for the CQL algorithm (Q-function learning rate = 3e-4, policy learning rate = 1e-4), based on prior works that utilize these domains. ...with regards to the hyperaprameter α in CQL... we used α = 0.1 for all Atari games... and α = 1.0 for the robotic manipulation domains... For the Antmaze and Adroit domains, we ran CQL training with multiple values of α {0.01, 0.1, 0.5, 1.0, 5.0, 10.0, 20.0}.