Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Reliably detecting model failures in deployment without labels
Authors: Viet Nguyen, Changjian Shui, Vijay Giri, Siddharth Arya, Amol Verma, Fahad Razak, Rahul G Krishnan
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results on both standard benchmark and a real-world large-scale internal medicine dataset demonstrate the effectiveness of the framework and highlight its viability as an alert mechanism for high-stakes machine learning pipelines. We showcase experimental results on various shift scenarios in the UCI Heart Disease dataset [21], CIFAR-10/10.1 [2], Camelyon17 (WILDS) [4], and the GEneral Medicine INpatient Initiative (GEMINI) dataset [22, 23] to demonstrate its effectiveness in monitoring models of various modalities. |
| Researcher Affiliation | Academia | 1University of Toronto 2University of Ottawa 3Vector Institute 4University of Pennsylvania 5Unity Health Toronto |
| Pseudocode | Yes | We defer the presentation and discussion of our theoretical analysis to Appendix A. Algorithm 1 Idealized D3M: Calibrate. Algorithm 2 Idealized D3M: Deploy. |
| Open Source Code | Yes | A detailed implementation can be found here: https://github.com/teivng/d3m. Our code is structured as a package and is entirely modular. We report hyperparameters and the random seed schedule used to reproduce our reported results. In addition, we provide boilerplate code in i Python notebooks for users looking to familiarize with our system and even build upon it. |
| Open Datasets | Yes | We showcase experimental results on various shift scenarios in the UCI Heart Disease dataset [21], CIFAR-10/10.1 [2], Camelyon17 (WILDS) [4], and the GEneral Medicine INpatient Initiative (GEMINI) dataset [22, 23] to demonstrate its effectiveness in monitoring models of various modalities. |
| Dataset Splits | Yes | For all experiments, the significance level α is fixed to 0.10. (2) For UCI Heart Disease, CIFAR-10/10.1, and Camelyon17, where there are known post-deployment deterioration, we evaluate the baselines and D3M s ability to monitor shift for query sizes {10, 20, 50} of the deployment distribution. The temporal shift analysis splits data into half-years 2018H1, 2019H2, etc. The baseline model uses 2017H1 and prior data for training, and 2017H2 for validation; Tab. 3 shows patient statistics for this split. The different age groups are created by splitting the data into 5 equally sized groups based on ages of patients: (1) 18-52, (2) 52-66, (3) 66-72, (4) 76 85, (5) 85+; Tab. 4 shows patient statistics for this split. |
| Hardware Specification | Yes | All experiments were run on High Performance Computing (HPC) clusters. UCI Heart Disease experiments were run on GPU nodes with at minimum 8GB of GPU memory, 6 CPU cores, and 8GB RAM. CIFAR-10/10.1 experiments were run on GPU nodes with at minimum 24GB of GPU memory to accomodate the largest configurations of convolutions, 12 CPU cores, and 12GB RAM. Camelyon17 experiments were run on GPU nodes with at minimum 80GB of GPU memory to accomodate the largest Res Nets during sweeping, 12 CPU cores, and 12GB RAM. GEMINI Dataset. Experiments on the GEMINI datset were run on GPU and CPU nodes. We request at minimum 16GB of GPU memory (when applicable), 9 CPU cores, and 32GB of RAM. |
| Software Dependencies | No | All optimization is done using Adam W [68]. We borrow the implementation from [25] which can be readily coupled with the above neural feature extractors for end-to-end ELBO maximization. Finally, we use scikit-learn: Machine learning in python [69]. |
| Experiment Setup | Yes | We report used hyperparameters in Tables 5 and 6 for transparency and reproducibility. In CIFAR-10/10.1, Hidden dimension refers to the dimensionality of the final output of FEθ, and test size m is identical for D3M and other baseline algorithms for each set of experiments. Tables 5 and 6 detail hyperparameters such as Learning rate, Batch size, Epochs, Weight decay, Hidden dimension, Num. hidden layers, Dropout, Regularization factor, Prior scale, Wishart scale, Sampling temperature, and Test size m. |