Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models
Authors: Yantai Yang, Yuhao Wang, Zichen Wen, Luo Zhongwei, Chang Zou, Zhipeng Zhang, Chuan Wen, Linfeng Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the efficacy of Efficient VLA through extensive experiments on the Cog ACT in the SIMPLER environment [21], achieving a 1.93 inference speedup and reducing FLOPs to 28.9%, all while incurring a minimal accuracy degradation of only 0.6%. |
| Researcher Affiliation | Collaboration | Yantai Yang1,2 Yuhao Wang1,3 Zichen Wen1 Luo Zhongwei1 Chang Zou1,4 Zhipeng Zhang1,5 Chuan Wen1 Linfeng Zhang1 1School of Artificial Intelligence, Shanghai Jiao Tong University 2Harbin Institute of Technology 3Xi an Jiaotong University 4University of Electronic Science and Technology of China 5Anyverse Dynamics |
| Pseudocode | No | The paper describes the methodologies for layer pruning, visual token pruning, and caching intermediate features with mathematical formulations and step-by-step explanations, for example in sections 3.2, 3.3, and 3.4. However, it does not include an explicit block labeled 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | Code: https://github.com/YantaiYang-05/EfficientVLA |
| Open Datasets | Yes | Our primary experimental validation of Efficient VLA is performed on the Cog ACT [41], which integrates powerful vision encoders (DINOv2 [39] and Sig LIP [40]), a Llama2-7B [14] language module for multimodal reasoning, and a Diffusion Transformer (Di T) for generating action trajectories. We conducted additional experiments on the model using the LIBERO [43] benchmark... This paper uses public datasets and models, and the specific links are given in the additional materials. |
| Dataset Splits | No | The SIMPLER supports two evaluation configurations: Visual Matching, which prioritizes fidelity to real-world appearances, and Variant Aggregation, which incorporates diverse conditions such as altered lighting, backgrounds, and surface textures. For the Google robot, SIMPLER provides both two evaluation settings, each featuring the same four tasks: 1) Pick coke can; 2) Move near; 3) Open/close drawer; and 4) Open top drawer and place apple. Success rate is used as the evaluation metric. |
| Hardware Specification | Yes | All experiments were conducted on NVIDIA A40 GPUs, and the inference time was measured as the average single-step inference duration. All of our experiments were conducted on a single NVIDIA A40 GPUs. ... fine-tuning it for 30k steps as a baseline and performing inference on a NVIDIA 4090. |
| Software Dependencies | No | The paper mentions several models and frameworks such as Cog ACT, DINOv2, Sig LIP, Llama2-7B, Diffusion Transformer (Di T), and Prune Net. However, it does not specify software dependencies like Python version, specific deep learning library (e.g., PyTorch, TensorFlow) with versions, or CUDA version. |
| Experiment Setup | Yes | For Efficient VLA, in addition to layer pruning, we further compressed the model parameters by adopting the Prune Net [44] configuration for LLM compression. Specifically, we applied a sparsity of 25% to the MLP layers of all Transformer blocks. For visual token pruning, we started from the 2nd Transformer layer with a ratio α = 50% and Kkey = 4 for key task-critical tokens. Furthermore, the cache interval was set to 5. |