Efficient Online Reinforcement Learning with Offline Data

Authors: Philip J. Ball, Laura Smith, Ilya Kostrikov, Sergey Levine

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We extensively ablate these design choices, demonstrating the key factors that most affect performance, and arrive at a set of recommendations that practitioners can readily apply, whether their data comprise a small number of expert demonstrations or large volumes of sub-optimal trajectories. We see that correct application of these simple recommendations can provide a 2.5 improvement over existing approaches across a diverse set of competitive benchmarks, with no additional computational overhead.
Researcher Affiliation Academia 1University of Oxford 2UC Berkeley. Correspondence to: Philip J. Ball <ball@robots.ox.ac.uk>, Laura Smith <smithlaura@berkeley.edu>, Ilya Kostrikov <kostrikov@berkeley.edu>.
Pseudocode Yes Algorithm 1 Online RL with Offline Data (RLPD)
Open Source Code Yes We have released our code here: github.com/ikostrikov/rlpd.
Open Datasets Yes Sparse Adroit (Nair et al., 2020). D4RL Ant Maze (Fu et al., 2020). D4RL Locomotion (Fu et al., 2020). V-D4RL (Lu et al., 2022)
Dataset Splits No No specific details about train/validation/test dataset splits, percentages, or explicit sample counts were found.
Hardware Specification No The paper mentions using 'the Savio computational cluster resource provided by the Berkeley Research Computing program at the University of California, Berkeley' but does not specify any particular GPU models, CPU models, or detailed hardware configurations used for the experiments.
Software Dependencies No The paper states the codebase is 'written in JAX (Bradbury et al., 2018)' but does not provide specific version numbers for JAX or any other software dependencies, such as Python, PyTorch/TensorFlow, or CUDA.
Experiment Setup Yes Table 1. RLPD hyperparameters. Parameter Value Online batch size 128 Offline batch size 128 Discount (γ) 0.99 Optimizer Adam Learning rate 3 10 4 Ensemble size (E) 10 Critic EMA weight (ρ) 0.005 Gradient Steps (State Based) (G or UTD) 20 Network Width 256 Units Initial Entropy Temperature (α) 1.0 Target Entropy dim(A)/2