Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SAFE: Multitask Failure Detection for Vision-Language-Action Models

Authors: Qiao Gu, Yuanliang Ju, Shengxiang Sun, Igor Gilitschenski, Haruki Nishimura, Masha Itkina, Florian Shkurti

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We test it on Open VLA, π0, and π0-FAST in both simulated and real-world environments extensively. We compare SAFE with diverse baselines and show that SAFE achieves state-of-the-art failure detection performance and a favorable trade-off between accuracy and detection time using conformal prediction. We conduct failure detection experiments on Open VLA [2], π0 [4] and π0-FAST [5], in both simulation and the real world.
Researcher Affiliation	Collaboration	1University of Toronto (Uof T), 2Uof T Robotics Institute, 3Vector Institute, 4Toyota Research Institute (TRI) EMAIL
Pseudocode	No	The paper describes the SAFE method in Section 4.2 ('Failure Detection by Feature Probing') with textual descriptions of its components (feature extraction, failure score predictor with MLP or LSTM backbones) and training losses. However, it does not present this information in a structured pseudocode or algorithm block.
Open Source Code	Yes	More qualitative results and code can be found at the project webpage: https://vla-safe.github.io/. The source code for this paper has been released at https://github.com/vla-safe/SAFE.
Open Datasets	Yes	LIBERO [56]: The LIBERO benchmark has been widely adopted for evaluating VLA models in simulation [2, 4 6]. ... Simpler Env [63]: Simpler Env provides a high-fidelity simulation environment for manipulation policies... We adopt the model checkpoints that are finetuned on the LIBERO benchmark and released by their authors. Real-world Franka Experiments: We deploy the π0-FAST-DROID checkpoint [4, 5]1 on a Franka Emika Panda Robot. This checkpoint has been finetuned on the DROID dataset [32]... Real-world Widow X Experiments: We also deploy the Open VLA model pretrained on the Open-X Magic Soup++ dataset [2] on a Widow X robot manipulator in our lab.
Dataset Splits	Yes	In experiments, we split all tasks into seen and unseen subsets, where rollouts from seen tasks are used for training Dtrain and validation Deval-seen, and all rollouts from unseen tasks Deval-unseen are reserved for testing the cross-task generalization ability of failure detectors. For evaluation, 3 out of 10 tasks are unseen, and within seen tasks, 60% of rollouts are used for Dtrain and the remaining 40% for Deval-seen. Within each embodiment, 1 out of 4 tasks is unseen, and within the seen tasks, 66% of the rollouts are in Dtrain and the remaining 33% in Deval-seen. In experiments, 3 tasks out of 13 are randomly selected as unseen tasks. In this experiment, 2 tasks out of 8 are randomly selected as unseen tasks.
Hardware Specification	Yes	All training and evaluation are done on a single NVIDIA A100 40GB GPU.
Software Dependencies	No	The paper mentions "JAX: composable transformations of Python+Num Py programs" [67] and "Adam optimizer [74]" as components or algorithms used. However, it does not specify concrete version numbers for JAX, Python, or any other software libraries, which is required for a reproducible description of ancillary software.
Experiment Setup	Yes	The SAFE models are trained for 1000 epochs with batch size 512. We use Adam optimizer [74] with β1 = 0.9, β1 = 0.999, ϵ = 10 8, and a learning rate (lr) determined by grid search. We also apply an L2 regularization loss on the model weights to reduce overfitting, and this loss is weighted by λreg and optimized together with the failure score learning loss LLSTM or LMLP. λreg are determined by grid search.