Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better
Authors: Danny Driess, Jost Springenberg, Brian Ichter, LILI YU, Adrian Li-Bell, Karl Pertsch, Allen Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, Sergey Levine
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental evaluation provides an extensive analysis of the various modeling choices in continuous-action VLAs, building on the π0 model architecture [7]. We evaluate on complex, long-horizon robotic manipulation tasks, including mobile bimanual robots, as well as open-source benchmarks such as DROID and LIBERO. |
| Researcher Affiliation | Industry | Danny Driess Jost Tobias Springenberg Brian Ichter Lili Yu Adrian Li-Bell Karl Pertsch Allen Z. Ren Homer Walke Quan Vuong Lucy Xiaoyang Shi Sergey Levine Physical Intelligence |
| Pseudocode | No | The paper describes methods with mathematical equations and textual explanations, but does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Videos are available at https://pi.website/research/knowledge_insulation and open-source model weights are available at https://github.com/Physical-Intelligence/openpi. |
| Open Datasets | Yes | We evaluate our method on the simulated LIBERO [31] benchmark, as well as the real world DROID [23] benchmark. ... We also include the open-source OXE dataset [37]. We also train the generalist model with a variety of general VLM tasks. The data involves image captioning (Caps Fusion [53], COCO [10]), visual-question-answering (Cambrian-7M [47], Pix Mo [13], VQAv2 [19]), as well as object localization. |
| Dataset Splits | No | The paper states that |
| Hardware Specification | Yes | The inference time of π0-FAST for predicting a 1-second action chunk is 750 ms on an RTX4090 GPU [38] |
| Software Dependencies | No | The paper mentions using the Pali Gemma VLM [4] architecture but does not specify version numbers for other key software components like libraries or frameworks (e.g., PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | We use the Pali Gemma VLM [4] architecture as the VLM backbone and initialize it with its pretrained weights. The action expert is a smaller transformer that takes in a sequence of noisy actions aτ,ω 1:H for an action horizon of 50, i.e. H = 50. ... The dimensions of the VLM backbone and action expert are as follows: {width=2048, depth=18, mlp_dim=16,384, num_heads=18, num_kv_heads=1, head_dim=256} for the 2B language model backbone, and the same except for {width=1024, mlp_dim=4096} for the action expert, leading to 300M parameters. ... We follow π0 for sampling the flow-matching timestep τ. ...given by p(τ) = Beta( s τ s ; α = 1.5, β = 1), s = 0.999. |