Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Rethinking Approximate Gaussian Inference in Classification

Authors: Bálint Mucsányi, Nathaël Da Costa, Philipp Hennig

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate it combined with several approximate Gaussian inference methods (Laplace, HET, SNGP) on large- and small-scale datasets (Image Net, CIFAR-100, CIFAR-10), demonstrating improved uncertainty quantification capabilities compared to softmax MC sampling. Our code is available at github.com/bmucsanyi/probit. 7 Experiments We now investigate our two research questions: (i) Do we have to sacrifice performance for sample-free predictives? (Section 7.1) (ii) What are the effects of changing the learning objective? (Section 7.2)
Researcher Affiliation	Academia	Bálint Mucsányi Nathaël Da Costa Tübingen AI Center University of Tübingen Philipp Hennig
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at github.com/bmucsanyi/probit.
Open Datasets	Yes	We evaluate it combined with several approximate Gaussian inference methods (Laplace, HET, SNGP) on large- and small-scale datasets (Image Net, CIFAR-100, CIFAR-10), demonstrating improved uncertainty quantification capabilities compared to softmax MC sampling. Our code is available at github.com/bmucsanyi/probit.
Dataset Splits	Yes	For Image Net evaluation, we use a Res Net-50 backbone from the timm library [53] pretrained with the softmax activation function, and train each (method, activation) pair for 50 Image Net-1k epochs following Mucsányi et al. [40]. For Vision Transformer [14] experiments, refer to Appendix K.2. On CIFAR-10, we train Res Net-28 models from scratch for 100 epochs. The CIFAR-100 experiments train Wide Res Net-28-5 models for 200 epochs. We search for ideal hyperparameters with a ten-step Bayesian Optimization scheme [47] in Weights & Biases [5]. During training, we keep track of the best-performing checkpoint on the validation set and load it before testing.
Hardware Specification	Yes	The hyperparameter optimization, training, and evaluation of the methods used in this paper took 0.8 GPU years on NVIDIA RTX 2080Ti GPUs in a university compute cluster. The individual runs required no more than 50 GB of RAM and 3 days of runtime. We measure the time to obtain the predictives from the logit-space Gaussians compared to the time of a forward pass on an NVIDIA RTX 2080 Ti GPU (with similar results on Tesla A100 GPUs).
Software Dependencies	No	For Image Net evaluation, we use a Res Net-50 backbone from the timm library [53] pretrained with the softmax activation function, and train each (method, activation) pair for 50 Image Net-1k epochs following Mucsányi et al. [40]. We train with the LAMB optimiser [57] using a batch size of 128 and gradient accumulation across 16 batches, resulting in an effective batch size of 2048, following Tran et al. [52]. We further use a cosine learning rate schedule with a single warmup epoch using a warmup learning rate of 0.0001. The learning rate is treated as a hyperparameter and selected from the interval [0.0005, 0.05] based on the validation performance. The weight decay is selected from the set {0.01, 0.02}. During training, we keep track of the best-performing checkpoint on the validation set and load it before testing. We search for ideal hyperparameters with a ten-step Bayesian Optimization scheme [47] in Weights & Biases [5] based on the negative log-likelihood. The paper mentions several software components (timm library, LAMB optimizer, Weights & Biases) but does not provide specific version numbers for them.
Experiment Setup	Yes	For Image Net evaluation, we use a Res Net-50 backbone from the timm library [53] pretrained with the softmax activation function, and train each (method, activation) pair for 50 Image Net-1k epochs following Mucsányi et al. [40]. We train with the LAMB optimiser [57] using a batch size of 128 and gradient accumulation across 16 batches, resulting in an effective batch size of 2048, following Tran et al. [52]. We further use a cosine learning rate schedule with a single warmup epoch using a warmup learning rate of 0.0001. The learning rate is treated as a hyperparameter and selected from the interval [0.0005, 0.05] based on the validation performance. The weight decay is selected from the set {0.01, 0.02}. During training, we keep track of the best-performing checkpoint on the validation set and load it before testing. We search for ideal hyperparameters with a ten-step Bayesian Optimization scheme [47] in Weights & Biases [5] based on the negative log-likelihood.