A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning
Authors: Yinmin Zhang, Jie Liu, Chuming Li, Yazhe Niu, Yaodong Yang, Yu Liu, Wanli Ouyang
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on the Mu Joco and Adroit environments demonstrate that the proposed method, named SO2, significantly alleviates Q-value estimation issues, and consistently improves the performance against the state-of-the-art methods by up to 83.1%.Our experiments aim to investigate the following concerns: Performance: Can our method improve the performance compared to existing O2O RL approaches and online RL approaches trained from scratch (see Figure 3)? Nupc : Does increasing update frequency per collection effectively enhance performance (see Table 3)? PVU: Does the proposed Perturbed Value Update stabilize the O2O training (see Table 2)? Q-value estimation: Whether the proposed method can effectively address Q-value estimation issues including estimation bias and inaccurate rank (see Figure 5b)? Extension: Does our method generalize to more challenging robotic manipulation tasks (see Table 4)? |
| Researcher Affiliation | Collaboration | Yinmin Zhang1, 2 *, Jie Liu2, 3 *, Chuming Li1, 2 *, Yazhe Niu2, Yaodong Yang4 , Yu Liu2 , Wanli Ouyang2 1 The University of Sydney, Sense Time Computer Vision Group, Australia 2 Shanghai Artificial Intelligence Laboratory 3 Multimedia Laboratory, The Chinese University of Hong Kong 4 Institute for AI, Peking University |
| Pseudocode | Yes | Algorithm 1: SO2 |
| Open Source Code | No | The paper does not provide any explicit statement about releasing source code or a link to a code repository. |
| Open Datasets | Yes | Our experiments aim to investigate the following concerns: Performance: Can our method improve the performance compared to existing O2O RL approaches and online RL approaches trained from scratch (see Figure 3)? Nupc : Does increasing update frequency per collection effectively enhance performance (see Table 3)? PVU: Does the proposed Perturbed Value Update stabilize the O2O training (see Table 2)? Q-value estimation: Whether the proposed method can effectively address Q-value estimation issues including estimation bias and inaccurate rank (see Figure 5b)? Extension: Does our method generalize to more challenging robotic manipulation tasks (see Table 4)? Evaluation on Mu Jo Co Tasks Setup. We first evaluate SO2 and baselines O2O RL algorithms on Mu Jo Co (Todorov, Erez, and Tassa 2012) tasks trained from the D4RL-v2 dataset consisting of three environments including Half Cheetah, Walker2d, and Hopper, each with four level datasets with different qualities in D4RL benchmark and generates four pretrained policies, resulting total 12 policies to average on. |
| Dataset Splits | No | The paper refers to 'offline training' and 'online finetuning' and evaluates performance, but it does not explicitly describe a 'validation' dataset split or its use for hyperparameter tuning or early stopping. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'DI-engine, a DRL framework', but it does not specify any version numbers for DI-engine or other software dependencies required to reproduce the experiments. |
| Experiment Setup | Yes | Our method uses perturbation noise with σ = 0.3, c = 0.6, and Nupc = 10 as the default setup. |