Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CogVLA: Cognition-Aligned Vision-Language-Action Models via Instruction-Driven Routing & Sparsification

Authors: Wei Li, Renshan Zhang, Rui Shao, Jie He, Liqiang Nie

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on the LIBERO benchmark and real-world robotic tasks demonstrate that Cog VLA achieves state-of-the-art performance with success rates of 97.4% and 70.0%, respectively, while reducing training costs by 2.5 and decreasing inference latency by 2.8 compared to Open VLA. We conduct comprehensive evaluations of Cog VLA on the LIBERO benchmark and real-world robotic manipulation tasks. Experimental results show that Cog VLA achieves state-of-the-art task success rates while reducing end-to-end computational costs significantly, as shown in Fig. 1 (g) and (h). Ablation studies further validate the complementarity and synergistic effect of the routing modules and the coupled attention mechanism.
Researcher Affiliation Academia Wei Li Renshan Zhang Rui Shao Jie He Liqiang Nie School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen EMAIL EMAIL
Pseudocode No The paper describes the operations of EFA-Routing, LFP-Routing, and CAtten using text and mathematical equations (e.g., equations 5-19) and illustrates the framework in figures, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' block with structured steps.
Open Source Code No Open access to data and code. Answer: [No] Justification: We train and evaluate our method using open-sourced datasets and models, which facilitates easy replication. Following acceptance, both the code and the data will be made available.
Open Datasets Yes Simulation Benchmark. We use the LIBERO benchmark [41] to evaluate task performance and efficiency. Its long and diverse instructions (avg. 10.48 words vs. 3.34 in RLBench) reflect the model s language understanding. LIBERO comprises four suites Spatial, Object, Goal, and Long each with 10 tasks and 50 demonstrations. Real-World Experiments. Cog VLA is deployed on the Cobot Agilex ALOHA platform for three long-horizon tasks: Object Placement, Drawer Manipulation, and T-shirt Folding (45, 45, and 30 demonstrations). We introduce spatial and semantic variations during data collection.
Dataset Splits No The paper mentions that the LIBERO benchmark comprises "50 demonstrations" for each task suite and real-world experiments involved gathering "45, 45, 30, 30, and 45 expert demonstrations" for specific tasks. It also states that "We conducted 500 trials for each task suite" for evaluation. While the total number of demonstrations and evaluation trials are given, the paper does not explicitly specify how these demonstrations are split into training, validation, and test sets (e.g., percentages, exact counts, or reference to a standard splitting methodology with a random seed).
Hardware Specification Yes All experiments are conducted on 4 A800 GPUs (80GB), benefiting from Cog VLA s efficient instruction-driven sparsification. Implementation details are in Appendix A.
Software Dependencies No The paper mentions using concepts like 'LLM embedding layer', 'transformer layer', 'MLPs', and 'Low-Rank Adaptation (Lo RA)' but does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA in the 'Implementation Details' (Appendix A) or any other section.
Experiment Setup Yes LIBERO Training Setup. We adopt Open VLA [31] as the backbone model and set the action chunk size to K = 8. Fine-tuning is performed using Low-Rank Adaptation (Lo RA) with a rank of 32 and an α value of 64. The model is trained for 60K steps with a batch size of 64 and an initial learning rate of 5e-4. Checkpoints are evaluated every 10K steps, and the best-performing checkpoint is selected for reporting. Real-World Training Setup. For the real-world experiments, we set the chunk size to K = 25 and fine-tune Open VLA using Lo RA with a rank of 32 and an alpha value of 64. The model is trained with a batch size of 32 for a total of 80K steps. The initial learning rate was set to 5e-4, which is reduced to 5e-5 after 50K steps. Starting from step 60K, we evaluate checkpoints every 10K steps and report the best-performing checkpoint.