Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Reinforcement Learning with Action Chunking
Authors: Qiyang Li, Zhiyuan (Paul) Zhou, Sergey Levine
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results demonstrate that Q-chunking exhibits strong offline performance and online sample efficiency, outperforming prior best offline-to-online methods on a range of long-horizon, sparse-reward manipulation tasks. |
| Researcher Affiliation | Academia | Qiyang Li, Zhiyuan Zhou, Sergey Levine UC Berkeley EMAIL |
| Pseudocode | Yes | Algorithm 1 QC Algorithm 2 QC-FQL Algorithm 3 Flow ODE_Euler(st, zt, fΞΎ, T) |
| Open Source Code | Yes | Code: github.com/Colin Qiyang Li/qc |
| Open Datasets | Yes | We consider 5 domains (5 tasks each) from OGBench [55], scene-sparse, puzzle-3x3-sparse, cube-double/triple/quadruple and 3 tasks from robomimic [43]. |
| Dataset Splits | Yes | Oftentimes, offline-to-online RL algorithms operate in two distinct phases: an offline phase where a policy is pretrained on the offline data D and an online phase where the policy is further fine-tuned online with environment interactions. The first 1M steps are offline and the next 1M steps are online with one environment step per training step (5 seeds). For online training of RLPD, QC-RLPD, we use the same strategy where we load in a 1M-size chunk of the dataset as the offline data and perform 50/50 sampling (e.g., 50% of the data comes from the 1M-chunk of the offline data, 50% of the data comes from the online replay buffer). |
| Hardware Specification | Yes | We use NVIDIA RTX-A5000 GPU to run all our experiments. |
| Software Dependencies | No | No specific software dependencies with version numbers are explicitly mentioned in the paper. The paper refers to algorithmic components and frameworks (e.g., 'TD3+BC-style objective', 'flow-matching loss') but not specific software library versions. |
| Experiment Setup | Yes | Table 3: Common hyperparameters. Batch size (M) 256 Discount factor (Ξ³) 0.99 Optimizer Adam Learning rate 3 10 4 Target network update rate (Ο) 5 10 3 Critic ensemble size (K) 10 for RLPD, RLPD-AC, QC-RLPD, and SUPE-GT 2 for QC-FQL, FQL, FQL-n, QC-BFN, BFN, BFN-n UTD Ratio 1 Number of flow steps (T) 10 Number of offline training steps 10^6 except RLPD-based approaches (0) Number of online environment steps 1 10^6 Network width 512 Network depth 4 hidden layers |