Efficient Online Reinforcement Learning with Offline Data
Authors: Philip J. Ball, Laura Smith, Ilya Kostrikov, Sergey Levine
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We extensively ablate these design choices, demonstrating the key factors that most affect performance, and arrive at a set of recommendations that practitioners can readily apply, whether their data comprise a small number of expert demonstrations or large volumes of sub-optimal trajectories. We see that correct application of these simple recommendations can provide a 2.5 improvement over existing approaches across a diverse set of competitive benchmarks, with no additional computational overhead. |
| Researcher Affiliation | Academia | 1University of Oxford 2UC Berkeley. Correspondence to: Philip J. Ball <ball@robots.ox.ac.uk>, Laura Smith <smithlaura@berkeley.edu>, Ilya Kostrikov <kostrikov@berkeley.edu>. |
| Pseudocode | Yes | Algorithm 1 Online RL with Offline Data (RLPD) |
| Open Source Code | Yes | We have released our code here: github.com/ikostrikov/rlpd. |
| Open Datasets | Yes | Sparse Adroit (Nair et al., 2020). D4RL Ant Maze (Fu et al., 2020). D4RL Locomotion (Fu et al., 2020). V-D4RL (Lu et al., 2022) |
| Dataset Splits | No | No specific details about train/validation/test dataset splits, percentages, or explicit sample counts were found. |
| Hardware Specification | No | The paper mentions using 'the Savio computational cluster resource provided by the Berkeley Research Computing program at the University of California, Berkeley' but does not specify any particular GPU models, CPU models, or detailed hardware configurations used for the experiments. |
| Software Dependencies | No | The paper states the codebase is 'written in JAX (Bradbury et al., 2018)' but does not provide specific version numbers for JAX or any other software dependencies, such as Python, PyTorch/TensorFlow, or CUDA. |
| Experiment Setup | Yes | Table 1. RLPD hyperparameters. Parameter Value Online batch size 128 Offline batch size 128 Discount (γ) 0.99 Optimizer Adam Learning rate 3 10 4 Ensemble size (E) 10 Critic EMA weight (ρ) 0.005 Gradient Steps (State Based) (G or UTD) 20 Network Width 256 Units Initial Entropy Temperature (α) 1.0 Target Entropy dim(A)/2 |