Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Failure Prediction at Runtime for Generative Robot Policies

Authors: Ralf Römer, Adrian Kobras, Luca Worbis, Angela Schoellig

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate FIPER across five simulation and real-world environments involving diverse failure modes. Our results demonstrate that FIPER better distinguishes actual failures from benign OOD situations and predicts failures more accurately and earlier than existing methods. We thus consider this work an important step towards more interpretable and safer generative robot policies. Code, data, and videos are available at tum-lsy.github.io/fiper_website. 5 Experiments
Researcher Affiliation Academia Ralf Römer1, Adrian Kobras1, Luca Worbis1 Angela P. Schoellig1,2,3 1 Technical University of Munich, Germany; Learning Systems and Robotics Lab; Munich Institute of Robotics and Machine Intelligence (MIRMI) 2 Robotics Institute Germany 3 Munich Center for Machine Learning EMAIL
Pseudocode No The paper describes the methodology using mathematical formulations and descriptive text, but it does not contain any explicitly labeled "Pseudocode" or "Algorithm" blocks with structured steps.
Open Source Code Yes Code, data, and videos are available at tum-lsy.github.io/fiper_website.
Open Datasets Yes Code, data, and videos are available at tum-lsy.github.io/fiper_website. For PUSHT and PUSHCHAIR, we use the publicly available rollout datasets from Agia et al. [1], which are released under an MIT License.
Dataset Splits Yes We use M = 50 successful rollouts for the three simulation environments and M = 10 for the two real-world tasks to train the learning-based failure predictors and calibrate the thresholds. Table 4: # Calibration rollouts: 50 (SORTING, STACKING, PUSHT), 10 (PRETZEL, PUSHCHAIR); # Test rollouts: 400 (SORTING), 800 (STACKING), 300 (PUSHT), 20 (PRETZEL, PUSHCHAIR); # Test rollouts (ID): 100 (SORTING), 200 (STACKING), 150 (PUSHT), 0 (PRETZEL, PUSHCHAIR); # Test rollouts (OOD): 300 (SORTING), 600 (STACKING), 150 (PUSHT), 20 (PRETZEL, PUSHCHAIR).
Hardware Specification Yes We conduct all experiments on a workstation with 64 GB of RAM, an NVIDIA Ge Force RTX 4090 GPU, and an Intel Core i9-285 K CPU.
Software Dependencies No The paper mentions various models and architectures (e.g., U-Net, transformer, ResNet-18) and optimizers (Adam W), but does not provide specific version numbers for software dependencies like programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes Table 5: Implementation details for our generative IL policies based on flow matching (FM) [40] and denoising diffusion probabilistic models (DDPM) [25]. The policy hyperparameters for Push T and Push Chair are taken from Agia et al. [1]. It includes: Observation history length, Action chunk length, Action execution steps, Action batch size, Integration steps, Training epochs, Training batch size, Optimizer, Learning rate. Table 3: Training parameters of the RND-OE model. It includes: Batch size, Epochs, Learning rate, Learning rate scheduler, Optimizer, Optimizer weight decay, Optimizer epsilon, Train/validation split.