Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration

Authors: Jiahui Zhang, Yurui Chen, Yueming Xu, Ze Huang, Yanpeng Zhou, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, Li Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that our pretraining method and architectural components substantially enhance model performance. In both simulated and real-world experiments, our model achieves a significant increase in success rate over Open VLA [1]. To further assess spatial perception and generalization to novel views, we introduce MV-Bench, a multi-view simulation benchmark. Our model consistently outperforms existing methods, demonstrating stronger spatial understanding and adaptability.
Researcher Affiliation Collaboration Jiahui Zhang1 Yurui Chen1 Yueming Xu1 Ze Huang1 Yanpeng Zhou2 Yu-Jie Yuan2 Xinyue Cai2 Guowei Huang2 Xingyue Quan2 Hang Xu2 Li Zhang1 1 School of Data Science, Fudan University 2 Huawei Noah s Ark Lab
Pseudocode Yes Algorithm 1: memory bank sampling Input: t, {It j | j = 0, 1, . . . , n 1}, sample size k, feature extractor ϕ Output: A set of sampled timestamps H Initialize H [t] Start with the current frame Initialize S [ inf] Similarity list for j = 1 to n 1 do s = Similarity(ϕ(IH[ 1]), ϕ(It j)) if len(H) < k then Append t j to H; Append s to S else m = arg max(S) if s < S[m] then Insert and reorganize Append t j to H; Append s to S s = Similarity(ϕ(IH[m 1]), ϕ(IH[m+1])) Remove H[m] Replace S[m + 1] with s ; Remove S[m] else Replace the last frame s = Similarity(ϕ(IH[ 2]), ϕ(IH[t j])) Remove H[ 1] Append t j to H; Append s to S return H
Open Source Code Yes https://github.com/Logos Robotics Group/4D-VLA
Open Datasets Yes DROID [2] A diverse real-world robot manipulation dataset with 76,000 demonstration trajectories, or 350 hours of interaction data, spanning a total of 564 scenes and 86 tasks, each featuring RGB-D data from two third-person and one wrist-mounted camera. LIBERO [4] The LIBERO benchmark is a simulation suite with 4 task sets designed to advance lifelong learning in robotic manipulation. ... To further assess spatial perception and generalization to novel views, we introduce MV-Bench, a multi-view simulation benchmark.
Dataset Splits Yes Benchmark settings. We build a multi-view dataset based on LIBERO-SPATIAL [4]. For each trajectory, we sample 6 training and 6 testing viewpoints uniformly within a 270 front-facing range. Evaluation includes two tasks: In-View, where training and testing use the same views; and Cross-View, where testing is done on unseen viewpoints.
Hardware Specification Yes Training was conducted on 8 NVIDIA A6000 GPUs over 96 hours. Inference with Flash Attention in bf16 requires approximately 8 GB of GPU memory.
Software Dependencies No Our pretrained model is based on Intern VL-4B. We leverage a pretrained large vision-language model (VLM) as the backbone, specifically Intern VL-4B [12], which consists of a text tokenizer T , a vision encoder E, and a Transformer decoder D.
Experiment Setup Yes Our pretrained model is based on Intern VL-4B. We use a temporal window of 20 and apply memory bank sampling to select 5 past frames along with the current frame. In the training process, we freeze the vision encoder but training all other parameters. λd is set to 1. We utilize a cosine learning rate scheduler with an initial learning rate of 2e-5. Our model was trained for 1 epoch with a batch size of 512, requiring around 20k steps to complete.