RvS: What is Essential for Offline RL via Supervised Learning?

Authors: Scott Emmons, Benjamin Eysenbach, Ilya Kostrikov, Sergey Levine

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments, we boil supervised learning for offline RL down to its essential elements. In every environment suite we consider, simply maximizing likelihood with a two-layer feedforward MLP is competitive with state-of-the-art results of substantially more complex methods based on TD learning or sequence modeling with Transformers.
Researcher Affiliation Academia 1UC Berkeley, 2Carnegie Mellon University
Pseudocode Yes Algorithm 1 Rv S-Learning
Open Source Code Yes 1. Rv S: https://github.com/scottemmons/rvs
Open Datasets Yes GCSL is a suite of goal-conditioned environments used by Ghosh et al. [18] to evaluate GCSL, a goal-conditioned Rv S method with online data collection. We adapt these tasks for offline RL by using a random policy to collect training data, which results in suboptimal trajectories. The tasks include 2D navigation with obstacles (Four Rooms, Eysenbach et al. [9]); two end-effector controlled Sawyer robotic arm tasks (Door and Pusher, Nair et al. [31]); the Lunar Lander video game, which requires controlling thrusters to land a simulated Lunar Excursion Module (Lander); and a manipulation task that requires rotating a valve with a robotic claw (Claw, Ahn et al. [2]). Gym Locomotion v2 tasks consist of the Half Cheetah, Hopper, and Walker datasets from the D4RL offline RL benchmark [12]. Franka Kitchen v0 is a 9-Do F robotic manipulation task paired with datasets of human demonstrations. This task originates from Gupta et al. [19] and was formalized as an offline RL task in D4RL [12]. Ant Maze v2 involves controlling an 8-Do F quadruped to navigate to a particular goal state. This benchmark task, from D4RL [12]
Dataset Splits Yes To test this, we train models on 80% of each Franka Kitchen dataset with two different hyperparameter settings: a regularized setting with dropout p = 0.1 and small batch size 256; and an unregularized setting without dropout and a large batch size of 16, 384. We show results in Figure 4 (right). For all three datasets, validation set error does correlate with performance, but the strength of this correlation varies significantly.
Hardware Specification No The paper does not explicitly describe the hardware used for experiments. There are no mentions of specific GPU models, CPU models, or cloud computing instance types (e.g., NVIDIA A100, Intel Xeon, AWS p3.8xlarge).
Software Dependencies No The paper mentions 'JAXRL: Implementations of Reinforcement Learning algorithms in JAX' and provides a link to its repository, but it does not specify version numbers for JAX or any other software libraries, which is required for reproducibility.
Experiment Setup Yes Table 2: Hyperparameters. Architecture and design parameters that we found to work best in each domain. We define an epoch length to include all start-goal pairs in GCSL, i.e., to be |D| H 2 . In D4RL, we set all epoch lengths at 2000 50 2 = 2450000. Hyperparameter Value Environment Hidden layers 2 All Layer width 1024 All Nonlinearity Re LU All Learning rate 1e-3 All Epochs 10 GCSL 50 Kitchen 2000 Gym Gradient steps 20000 Ant Maze Batch size 256 GCSL, Kitchen 16384 Gym, Ant Maze Dropout 0.1 GCSL, Kitchen 0 Gym, Ant Maze Goal state Given GCSL All subtasks completed Kitchen (x, y) location Ant Maze Reward target 110 Gym medium-expert, Ant Maze, Kitchen 90 Gym {hopper, walker2d}-medium-replay, walker2d-medium 60 Gym hopper-medium 40 Gym random, halfcheetah-{medium, medium-replay} Policy output Discrete categorical GCSL Unimodal Gaussian Kitchen, Gym, Ant Maze