Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Fast-in-Slow: A Dual-System VLA Model Unifying Fast Manipulation within Slow Reasoning

Authors: Hao Chen, Jiaming Liu, Chenyang Gu, Zhuoyang Liu, Renrui Zhang, Xiaoqi Li, Xiao He, Yandong Guo, Chi-Wing Fu, Shanghang Zhang, Pheng-Ann Heng

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental For evaluation, Fi S-VLA outperforms previous state-of-the-art methods by 8% in simulation and 11% in realworld tasks in terms of average success rate, while achieving a 117.7 Hz control frequency with action chunk set to eight. In both real-world and simulated experiments, Fi S-VLA achieves state-of-the-art (SOTA) manipulation performance. In Section 4.1, we compare the manipulation performance and inference speed of Fi S-VLA with prior methods in simulated environments. The effectiveness of each component is evaluated in Section 4.2 and Appendix B. Section 4.3 presents both quantitative and qualitative results for Fi S-VLA on realworld manipulation tasks, including dual-arm control under different robot configurations. Finally, in Section 4.4, we demonstrate the generalization capabilities of Fi S-VLA by assessing its performance on previously unseen objects, backgrounds, and lighting conditions.
Researcher Affiliation Collaboration 1The Chinese University of Hong Kong 2State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University 3AI2Robotics 4Beijing Academy of Artificial Intelligence (BAAI)
Pseudocode No The paper describes its methodology through textual descriptions and architectural diagrams (e.g., Figure 2: Framework of Fi S-VLA), but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Project web page: fast-in-slow.github.io. (Abstract) and Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We did.
Open Datasets Yes We curated a specialized pretraining dataset by carefully processing and filtering large-scale cross-embodiment datasets including Open X-Embodiment [19], DROID [20], ROBOMIND [32], and so on. As detailed in Appendix A, this dataset comprises over 860K trajectories. It is then fine-tuned on high-quality, self-collected real-world and simulation data [33]. Table 4 provides a comprehensive list of all datasets used in pre-training along with their corresponding sampling weights.
Dataset Splits Yes we construct a training dataset where each task contains 100 trajectories. and we evaluate all methods using 20 rollouts from the latest epoch checkpoint, repeating the evaluation three times for each task and reporting the average success rate along with the variance.
Hardware Specification Yes With a 1:4 operating frequency ratio between System 2 and System 1, Fi S-VLA achieves a 117.7 Hz control frequency on an NVIDIA 4090 GPU with action chunk set to eight. Fi S-VLA model is trained for 300 epochs using the Adam W optimizer [74] on 8 NVIDIA A800 GPUs, with mixed-precision training employed. The Agilex Robot is equipped with two 6-Do F arms mounted on a mobile base. ... two Orbbec DABAI cameras capture the left and right wrist views, while a Real Sense 435 camera mounted overhead provides exterior-view RGB images and point cloud data.
Software Dependencies No The paper mentions the use of 'Coppelia Sim simulator', 'Open Motion Planning Library', and 'Adam W optimizer', but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes Fi S-VLA model is trained for 300 epochs using the Adam W optimizer [74] on 8 NVIDIA A800 GPUs, with mixed-precision training employed. For Fi S-VLA s input, the single-view RGB image is resized to 224 224, the point cloud is downsampled to 1024 points, the text instruction is derived from simulation, and the robot state is aligned with the predicted actions. With a 1:4 operating frequency ratio between System 2 and System 1, Fi S-VLA achieves a 117.7 Hz control frequency...