Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better

Authors: Danny Driess, Jost Springenberg, Brian Ichter, LILI YU, Adrian Li-Bell, Karl Pertsch, Allen Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, Sergey Levine

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental evaluation provides an extensive analysis of the various modeling choices in continuous-action VLAs, building on the π0 model architecture [7]. We evaluate on complex, long-horizon robotic manipulation tasks, including mobile bimanual robots, as well as open-source benchmarks such as DROID and LIBERO.
Researcher Affiliation Industry Danny Driess Jost Tobias Springenberg Brian Ichter Lili Yu Adrian Li-Bell Karl Pertsch Allen Z. Ren Homer Walke Quan Vuong Lucy Xiaoyang Shi Sergey Levine Physical Intelligence
Pseudocode No The paper describes methods with mathematical equations and textual explanations, but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Videos are available at https://pi.website/research/knowledge_insulation and open-source model weights are available at https://github.com/Physical-Intelligence/openpi.
Open Datasets Yes We evaluate our method on the simulated LIBERO [31] benchmark, as well as the real world DROID [23] benchmark. ... We also include the open-source OXE dataset [37]. We also train the generalist model with a variety of general VLM tasks. The data involves image captioning (Caps Fusion [53], COCO [10]), visual-question-answering (Cambrian-7M [47], Pix Mo [13], VQAv2 [19]), as well as object localization.
Dataset Splits No The paper states that
Hardware Specification Yes The inference time of π0-FAST for predicting a 1-second action chunk is 750 ms on an RTX4090 GPU [38]
Software Dependencies No The paper mentions using the Pali Gemma VLM [4] architecture but does not specify version numbers for other key software components like libraries or frameworks (e.g., PyTorch, TensorFlow, CUDA).
Experiment Setup Yes We use the Pali Gemma VLM [4] architecture as the VLM backbone and initialize it with its pretrained weights. The action expert is a smaller transformer that takes in a sequence of noisy actions aτ,ω 1:H for an action horizon of 50, i.e. H = 50. ... The dimensions of the VLM backbone and action expert are as follows: {width=2048, depth=18, mlp_dim=16,384, num_heads=18, num_kv_heads=1, head_dim=256} for the 2B language model backbone, and the same except for {width=1024, mlp_dim=4096} for the action expert, leading to 300M parameters. ... We follow π0 for sampling the flow-matching timestep τ. ...given by p(τ) = Beta( s τ s ; α = 1.5, β = 1), s = 0.999.