Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Reinforcement Learning with Action Chunking

Authors: Qiyang Li, Zhiyuan (Paul) Zhou, Sergey Levine

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results demonstrate that Q-chunking exhibits strong offline performance and online sample efficiency, outperforming prior best offline-to-online methods on a range of long-horizon, sparse-reward manipulation tasks.
Researcher Affiliation Academia Qiyang Li, Zhiyuan Zhou, Sergey Levine UC Berkeley EMAIL
Pseudocode Yes Algorithm 1 QC Algorithm 2 QC-FQL Algorithm 3 Flow ODE_Euler(st, zt, fΞΎ, T)
Open Source Code Yes Code: github.com/Colin Qiyang Li/qc
Open Datasets Yes We consider 5 domains (5 tasks each) from OGBench [55], scene-sparse, puzzle-3x3-sparse, cube-double/triple/quadruple and 3 tasks from robomimic [43].
Dataset Splits Yes Oftentimes, offline-to-online RL algorithms operate in two distinct phases: an offline phase where a policy is pretrained on the offline data D and an online phase where the policy is further fine-tuned online with environment interactions. The first 1M steps are offline and the next 1M steps are online with one environment step per training step (5 seeds). For online training of RLPD, QC-RLPD, we use the same strategy where we load in a 1M-size chunk of the dataset as the offline data and perform 50/50 sampling (e.g., 50% of the data comes from the 1M-chunk of the offline data, 50% of the data comes from the online replay buffer).
Hardware Specification Yes We use NVIDIA RTX-A5000 GPU to run all our experiments.
Software Dependencies No No specific software dependencies with version numbers are explicitly mentioned in the paper. The paper refers to algorithmic components and frameworks (e.g., 'TD3+BC-style objective', 'flow-matching loss') but not specific software library versions.
Experiment Setup Yes Table 3: Common hyperparameters. Batch size (M) 256 Discount factor (Ξ³) 0.99 Optimizer Adam Learning rate 3 10 4 Target network update rate (Ο„) 5 10 3 Critic ensemble size (K) 10 for RLPD, RLPD-AC, QC-RLPD, and SUPE-GT 2 for QC-FQL, FQL, FQL-n, QC-BFN, BFN, BFN-n UTD Ratio 1 Number of flow steps (T) 10 Number of offline training steps 10^6 except RLPD-based approaches (0) Number of online environment steps 1 10^6 Network width 512 Network depth 4 hidden layers