Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models

Authors: Peiyan Li, Yixiang Chen, Hongtao Wu, Xiao Ma, Xiangnan Wu, Yan Huang, Liang Wang, Tao Kong, Tieniu Tan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that the resulting model, Bridge VLA, can learn 3D manipulation both efficiently and effectively. Bridge VLA outperforms state-of-the-art baselines across three simulation benchmarks. In RLBench, it improves the average success rate from 81.4% to 88.2%. ... In real-robot experiments, Bridge VLA outperforms a stateof-the-art baseline method by 32% on average.
Researcher Affiliation Collaboration 1New Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences 2 Byte Dance Seed 3School of Artificial Intelligence, University of Chinese Academy of Sciences 4 Five Ages 5 Nanjing University
Pseudocode No The paper describes its methodology and processes through natural language text and figures (e.g., Fig. 1, Fig. 2) rather than explicit pseudocode blocks or algorithms.
Open Source Code No We will opensource the code, data and checkpoints upon acceptance.
Open Datasets Yes It outperforms state-of-the-art baseline methods in RLBench [19], improving the average success rate from 81.4% to 88.2%. In COLOSSEUM [35], it showcases strong performance in challenging generalization settings, boosting the success rate from 56.7% to 64.0%. In Gem Bench [12], it surpasses all the comparing baseline methods in terms of average success rate. ... Concretely, we leverage the 120K object detection split of Robo Point [49] as our pre-training dataset.
Dataset Splits Yes Each task is provided with 100 expert demonstrations. And each demonstration is paired with language instruction and multiple keyframes. Models are evaluated via binary success rates over 25 trials per task, with a maximum of 25 action steps per trial. ... For each task, we collect 10 expert trajectories for training. ... To assess the data efficiency of Bridge VLA, we also train the model with only 3 trajectories per task.
Hardware Specification Yes 1. Pre-training: 8 NVIDIA A100 GPUs for 3,800 steps ( 2 hours) 2. RLBench fine-tuning: 48 NVIDIA H100 GPUs for 83,000 steps ( 20 hours) 3. COLOSSEUM fine-tuning: 48 NVIDIA H100 GPUs for 83,000 steps ( 20 hours) 4. Gem Bench fine-tuning: 40 NVIDIA A100 GPUs for 50 epochs ( 2.1 hours) 5. Real-world fine-tuning: 8 NVIDIA A100 GPUs for 300 epochs ( 1.5 hours) For inference, we run Bridge VLA on a machine equipped with an NVIDIA RTX 4090 GPU.
Software Dependencies No The paper mentions software like Coppelia Sim and models like Pali Gemma, Sig LIP vision encoder, and Gemma transformer backbone, but does not provide specific version numbers for these or other key software libraries/dependencies.
Experiment Setup Yes Detailed training configurations are summarized in Tab. 3. Throughout both pre-training and finetuning, we keep the Sig LIP vision encoder and language token embeddings frozen. ... Table 3: Training hyperparameters for Bridge VLA: learning rate 5e-5 (Pretrain) to 2e-5 (Real-robot Finetune); optimizer Adam W; batch size 384 (Pretrain) to 192 (Finetune); warmup steps 400 (Pretrain).