A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning

Authors: Yinmin Zhang, Jie Liu, Chuming Li, Yazhe Niu, Yaodong Yang, Yu Liu, Wanli Ouyang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on the Mu Joco and Adroit environments demonstrate that the proposed method, named SO2, significantly alleviates Q-value estimation issues, and consistently improves the performance against the state-of-the-art methods by up to 83.1%.Our experiments aim to investigate the following concerns: Performance: Can our method improve the performance compared to existing O2O RL approaches and online RL approaches trained from scratch (see Figure 3)? Nupc : Does increasing update frequency per collection effectively enhance performance (see Table 3)? PVU: Does the proposed Perturbed Value Update stabilize the O2O training (see Table 2)? Q-value estimation: Whether the proposed method can effectively address Q-value estimation issues including estimation bias and inaccurate rank (see Figure 5b)? Extension: Does our method generalize to more challenging robotic manipulation tasks (see Table 4)?
Researcher Affiliation Collaboration Yinmin Zhang1, 2 *, Jie Liu2, 3 *, Chuming Li1, 2 *, Yazhe Niu2, Yaodong Yang4 , Yu Liu2 , Wanli Ouyang2 1 The University of Sydney, Sense Time Computer Vision Group, Australia 2 Shanghai Artificial Intelligence Laboratory 3 Multimedia Laboratory, The Chinese University of Hong Kong 4 Institute for AI, Peking University
Pseudocode Yes Algorithm 1: SO2
Open Source Code No The paper does not provide any explicit statement about releasing source code or a link to a code repository.
Open Datasets Yes Our experiments aim to investigate the following concerns: Performance: Can our method improve the performance compared to existing O2O RL approaches and online RL approaches trained from scratch (see Figure 3)? Nupc : Does increasing update frequency per collection effectively enhance performance (see Table 3)? PVU: Does the proposed Perturbed Value Update stabilize the O2O training (see Table 2)? Q-value estimation: Whether the proposed method can effectively address Q-value estimation issues including estimation bias and inaccurate rank (see Figure 5b)? Extension: Does our method generalize to more challenging robotic manipulation tasks (see Table 4)? Evaluation on Mu Jo Co Tasks Setup. We first evaluate SO2 and baselines O2O RL algorithms on Mu Jo Co (Todorov, Erez, and Tassa 2012) tasks trained from the D4RL-v2 dataset consisting of three environments including Half Cheetah, Walker2d, and Hopper, each with four level datasets with different qualities in D4RL benchmark and generates four pretrained policies, resulting total 12 policies to average on.
Dataset Splits No The paper refers to 'offline training' and 'online finetuning' and evaluates performance, but it does not explicitly describe a 'validation' dataset split or its use for hyperparameter tuning or early stopping.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions using 'DI-engine, a DRL framework', but it does not specify any version numbers for DI-engine or other software dependencies required to reproduce the experiments.
Experiment Setup Yes Our method uses perturbation noise with σ = 0.3, c = 0.6, and Nupc = 10 as the default setup.