Jump-Start Reinforcement Learning
Authors: Ikechukwu Uchendu, Ted Xiao, Yao Lu, Banghua Zhu, Mengyuan Yan, Joséphine Simon, Matthew Bennice, Chuyuan Fu, Cong Ma, Jiantao Jiao, Sergey Levine, Karol Hausman
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show via experiments that it is able to significantly outperform existing imitation and reinforcement learning algorithms, particularly in the small-data regime. In addition, we provide an upper bound on the sample complexity of JSRL and show that with the help of a guide-policy, one can improve the sample complexity for non-optimism exploration methods from exponential in horizon to polynomial. Finally, we demonstrate that JSRL significantly outperforms previously proposed imitation and reinforcement learning approaches on a set of benchmark tasks as well as more challenging vision-based robotic problems. |
| Researcher Affiliation | Collaboration | 1Google, Mountain View, California 2University of California, Berkeley, Berkeley, California 3Everyday Robots, Mountain View, California, United States 4Department of Statistics, University of Chicago 5Stanford University, Stanford, California. |
| Pseudocode | Yes | We provide a detailed description of JSRL in Algorithm 1. ... Algorithm 2 Jump-Start Reinforcement Learning for Episodic MDP with CB oracle |
| Open Source Code | Yes | 0A project webpage is available at https://jumpstartrl.github.io |
| Open Datasets | Yes | To study how JSRL compares with competitive IL+RL methods, we utilize the D4RL (Fu et al., 2020) benchmark tasks, which vary in task complexity and offline dataset quality. |
| Dataset Splits | No | The paper describes data usage for training (e.g., 'offline buffer' and 'online buffer' with sampling percentages) and mentions policy evaluation, but does not explicitly state the use of a distinct 'validation set' or 'validation split' for hyperparameter tuning or early stopping. |
| Hardware Specification | No | The paper mentions simulating a '7 DoF arm' as part of the experimental setup but does not specify any computing hardware such as CPU, GPU models, or memory used for training or running simulations. |
| Software Dependencies | No | The paper mentions using open-sourced implementations of IQL, AWAC, BC, and CQL, and building upon the QT-Opt algorithm, but it does not specify any version numbers for these software components or any other key libraries or programming languages used. |
| Experiment Setup | Yes | We consider a common setting where the agent first trains on an offline dataset (1m transitions for Ant Maze, 100k transitions for Adroit) and then runs online fine-tuning for 1m steps. For fine-tuning, we maintain two replay buffers for offline and online transitions. The offline buffer contains all the demonstrations, and the online buffer is FIFO with a fixed capacity of 100k transitions. For each gradient update during fine-tuning, we sample minibatches such that 75% of samples come from the online buffer, and 25% of samples come from the offline buffer. JSRL introduces three hyperparameters: (1) the initial number of guide-steps that the guide-policy takes at the beginning of fine-tuning (H1), (2) the number of curriculum stages (n), and (3) the performance threshold that decides whether to move on to the next curriculum stage (β). |