reproducibilityindex.ai

RvS: What is Essential for Offline RL via Supervised Learning?

Authors: Scott Emmons, Benjamin Eysenbach, Ilya Kostrikov, Sergey Levine

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments, we boil supervised learning for ofﬂine RL down to its essential elements. In every environment suite we consider, simply maximizing likelihood with a two-layer feedforward MLP is competitive with state-of-the-art results of substantially more complex methods based on TD learning or sequence modeling with Transformers.
Researcher Affiliation	Academia	1UC Berkeley, 2Carnegie Mellon University
Pseudocode	Yes	Algorithm 1 Rv S-Learning
Open Source Code	Yes	1. Rv S: https://github.com/scottemmons/rvs
Open Datasets	Yes	GCSL is a suite of goal-conditioned environments used by Ghosh et al. [18] to evaluate GCSL, a goal-conditioned Rv S method with online data collection. We adapt these tasks for ofﬂine RL by using a random policy to collect training data, which results in suboptimal trajectories. The tasks include 2D navigation with obstacles (Four Rooms, Eysenbach et al. [9]); two end-effector controlled Sawyer robotic arm tasks (Door and Pusher, Nair et al. [31]); the Lunar Lander video game, which requires controlling thrusters to land a simulated Lunar Excursion Module (Lander); and a manipulation task that requires rotating a valve with a robotic claw (Claw, Ahn et al. [2]). Gym Locomotion v2 tasks consist of the Half Cheetah, Hopper, and Walker datasets from the D4RL ofﬂine RL benchmark [12]. Franka Kitchen v0 is a 9-Do F robotic manipulation task paired with datasets of human demonstrations. This task originates from Gupta et al. [19] and was formalized as an ofﬂine RL task in D4RL [12]. Ant Maze v2 involves controlling an 8-Do F quadruped to navigate to a particular goal state. This benchmark task, from D4RL [12]
Dataset Splits	Yes	To test this, we train models on 80% of each Franka Kitchen dataset with two different hyperparameter settings: a regularized setting with dropout p = 0.1 and small batch size 256; and an unregularized setting without dropout and a large batch size of 16, 384. We show results in Figure 4 (right). For all three datasets, validation set error does correlate with performance, but the strength of this correlation varies signiﬁcantly.
Hardware Specification	No	The paper does not explicitly describe the hardware used for experiments. There are no mentions of specific GPU models, CPU models, or cloud computing instance types (e.g., NVIDIA A100, Intel Xeon, AWS p3.8xlarge).
Software Dependencies	No	The paper mentions 'JAXRL: Implementations of Reinforcement Learning algorithms in JAX' and provides a link to its repository, but it does not specify version numbers for JAX or any other software libraries, which is required for reproducibility.
Experiment Setup	Yes	Table 2: Hyperparameters. Architecture and design parameters that we found to work best in each domain. We deﬁne an epoch length to include all start-goal pairs in GCSL, i.e., to be \|D\| H 2 . In D4RL, we set all epoch lengths at 2000 50 2 = 2450000. Hyperparameter Value Environment Hidden layers 2 All Layer width 1024 All Nonlinearity Re LU All Learning rate 1e-3 All Epochs 10 GCSL 50 Kitchen 2000 Gym Gradient steps 20000 Ant Maze Batch size 256 GCSL, Kitchen 16384 Gym, Ant Maze Dropout 0.1 GCSL, Kitchen 0 Gym, Ant Maze Goal state Given GCSL All subtasks completed Kitchen (x, y) location Ant Maze Reward target 110 Gym medium-expert, Ant Maze, Kitchen 90 Gym {hopper, walker2d}-medium-replay, walker2d-medium 60 Gym hopper-medium 40 Gym random, halfcheetah-{medium, medium-replay} Policy output Discrete categorical GCSL Unimodal Gaussian Kitchen, Gym, Ant Maze