Jump-Start Reinforcement Learning

Authors: Ikechukwu Uchendu, Ted Xiao, Yao Lu, Banghua Zhu, Mengyuan Yan, Joséphine Simon, Matthew Bennice, Chuyuan Fu, Cong Ma, Jiantao Jiao, Sergey Levine, Karol Hausman

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show via experiments that it is able to significantly outperform existing imitation and reinforcement learning algorithms, particularly in the small-data regime. In addition, we provide an upper bound on the sample complexity of JSRL and show that with the help of a guide-policy, one can improve the sample complexity for non-optimism exploration methods from exponential in horizon to polynomial. Finally, we demonstrate that JSRL significantly outperforms previously proposed imitation and reinforcement learning approaches on a set of benchmark tasks as well as more challenging vision-based robotic problems.
Researcher Affiliation Collaboration 1Google, Mountain View, California 2University of California, Berkeley, Berkeley, California 3Everyday Robots, Mountain View, California, United States 4Department of Statistics, University of Chicago 5Stanford University, Stanford, California.
Pseudocode Yes We provide a detailed description of JSRL in Algorithm 1. ... Algorithm 2 Jump-Start Reinforcement Learning for Episodic MDP with CB oracle
Open Source Code Yes 0A project webpage is available at https://jumpstartrl.github.io
Open Datasets Yes To study how JSRL compares with competitive IL+RL methods, we utilize the D4RL (Fu et al., 2020) benchmark tasks, which vary in task complexity and offline dataset quality.
Dataset Splits No The paper describes data usage for training (e.g., 'offline buffer' and 'online buffer' with sampling percentages) and mentions policy evaluation, but does not explicitly state the use of a distinct 'validation set' or 'validation split' for hyperparameter tuning or early stopping.
Hardware Specification No The paper mentions simulating a '7 DoF arm' as part of the experimental setup but does not specify any computing hardware such as CPU, GPU models, or memory used for training or running simulations.
Software Dependencies No The paper mentions using open-sourced implementations of IQL, AWAC, BC, and CQL, and building upon the QT-Opt algorithm, but it does not specify any version numbers for these software components or any other key libraries or programming languages used.
Experiment Setup Yes We consider a common setting where the agent first trains on an offline dataset (1m transitions for Ant Maze, 100k transitions for Adroit) and then runs online fine-tuning for 1m steps. For fine-tuning, we maintain two replay buffers for offline and online transitions. The offline buffer contains all the demonstrations, and the online buffer is FIFO with a fixed capacity of 100k transitions. For each gradient update during fine-tuning, we sample minibatches such that 75% of samples come from the online buffer, and 25% of samples come from the offline buffer. JSRL introduces three hyperparameters: (1) the initial number of guide-steps that the guide-policy takes at the beginning of fine-tuning (H1), (2) the number of curriculum stages (n), and (3) the performance threshold that decides whether to move on to the next curriculum stage (β).