Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Uncertainty Quantification with the Empirical Neural Tangent Kernel

Authors: Joseph Wilson, Chris van der Heide, Liam Hodgkinson, Fred Roosta

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through a series of numerical experiments, we show that our method not only outperforms competing approaches in computational efficiency often reducing costs by multiple factors but also maintains state-of-the-art performance across a variety of UQ metrics for both regression and classification tasks.
Researcher Affiliation Academia Joseph Wilson School of Mathematics And Physics University of Queensland EMAIL Chris van der Heide Dept. of Electrical and Electronic Engineering University of Melbourne EMAIL Liam Hodgkinson School of Mathematics and Statistics University of Melbourne EMAIL Fred Roosta CIRES and School of Mathematics And Physics University of Queensland EMAIL
Pseudocode Yes Algorithm 1 NUQLS Input: number of realizations S, weights bθ. for s = 1 to S do θ0,s bθ + z0, where z0 N(0, γ2I) θ s Run (stochastic) GD from θ0,s to (approximately) solve (3) and obtain θ s end for return {ef(θ s, .)}S s=1
Open Source Code Yes The Py Torch implementation of our experiments is available here. We have also released our method as a package.
Open Datasets Yes In Tables 1 and 9, we compare NUQLS with DE, LLA and SWAG on a series of UCI regression problems. ... Figure 3 presents a violin plot of the VMSP for three test-groups: correctly predicted in-distribution (Fashion MNIST, CIFAR-10) test points, incorrectly predicted in-distribution test points, and out-of distribution (MNIST, CIFAR-100) test points. ... In the top right of Figure 6, we evaluated NUQLS on the Image Net dataset...
Dataset Splits Yes For each dataset, we ran a number of experiments to get a mean and standard deviation for performance metrics. In each experiment, we took a random 70%/15%/15% split of the dataset for training, testing, and validation. ... For MNIST and Fashion MNIST, we took a 5 : 1 training/validation split of the training data.
Hardware Specification Yes All experiments were run either on an Intel i7-12700 CPU (toy regression), or on an H100 80GB GPU (UCI regression and image classification).
Software Dependencies No The Py Torch implementation of our experiments is available here. ... The variational inference method used is Bayes By Backprop (Blundell etm al., 2015), as deployed in the Bayesian Torch package (Krishnan et al., 2022). ... We employed the Lightning UQ Box (Lehmann et al., 2025) implementation of SNGP for our experiments.
Experiment Setup Yes The training hyper-parameters for the MAP, DE and NUQLS networks, size of the MLP used, and the number of experiments conducted for each dataset can be found in Table 12. ... We display the training procedure in Table 13 for both Figure 3 and Table 2.