Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Towards Certification of Uncertainty Calibration under Adversarial Attacks

Authors: Cornelius Emde, Francesco Pinto, Thomas Lukasiewicz, Philip Torr, Adel Bibi

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically show that it is possible to produce adversaries that severely impact the reliability of confidence scores while leaving the accuracy unchanged... In Table 1, we show that all four possible configurations of our (η, ω)-ACE can be effective at significantly altering the ECE of Pre Act Res Net18 ... on the validation set of CIFAR-10 ... and Image Net-1K.... 5 Experiments We empirically evaluate the methods introduced above.
Researcher Affiliation	Academia	1University of Oxford 2Vienna University of Technology
Pseudocode	Yes	Algorithm 2 Adversarial Calibration Training One Batch
Open Source Code	No	For further details please refer to the published code.
Open Datasets	Yes	CIFAR-10 (Krizhevsky, 2009) and Image Net-1K (Deng et al., 2009), Fashion MNIST (Xiao et al., 2017), Street View House Number (SVHN) dataset (Netzer et al., 2011), CIFAR-100 (Krizhevsky, 2009)
Dataset Splits	Yes	For Image Net, we sample 500 images from the test set, following prior work. ...focus on a subset of 2000 certified samples for CIFAR-10. ...We certify 500 samples on Fashion MNIST rather than the full test-set on CIFAR-10 due to the cost of randomized smoothing. ...We certify 500 samples on SVHN rather than the full test-set on CIFAR-10 due to the cost of randomized smoothing. ...We certify 500 samples on CIFAR-100 rather than the full test-set on CIFAR-10 due to the cost of randomized smoothing.
Hardware Specification	Yes	Our implementation utilises the torch.sparse package in version 2.0 (Paszke et al., 2019) and runs in less than 2 minutes for 7000 certified data points and 15 bins on a Nvidia RTX 3090. ...We mostly use A40 GPUs and equivalent older models.
Software Dependencies	Yes	Our implementation utilises the torch.sparse package in version 2.0 (Paszke et al., 2019)
Experiment Setup	Yes	We train using SGD with batch size 256 and weight decay of 0.0001. We use a learning rate of 2v ϵ T with factor v as additional hyperparameter. ...We fine-tune for 10 epochs with a linear warm-up schedule for ϵ that reaches full size at epoch 3. We decrease the learning rate for the model weights every 4 epochs by a factor of 0.1. ...all of the runs are performed on a batch size of 2048.