Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Towards Certification of Uncertainty Calibration under Adversarial Attacks
Authors: Cornelius Emde, Francesco Pinto, Thomas Lukasiewicz, Philip Torr, Adel Bibi
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically show that it is possible to produce adversaries that severely impact the reliability of confidence scores while leaving the accuracy unchanged... In Table 1, we show that all four possible configurations of our (η, ω)-ACE can be effective at significantly altering the ECE of Pre Act Res Net18 ... on the validation set of CIFAR-10 ... and Image Net-1K.... 5 Experiments We empirically evaluate the methods introduced above. |
| Researcher Affiliation | Academia | 1University of Oxford 2Vienna University of Technology |
| Pseudocode | Yes | Algorithm 2 Adversarial Calibration Training One Batch |
| Open Source Code | No | For further details please refer to the published code. |
| Open Datasets | Yes | CIFAR-10 (Krizhevsky, 2009) and Image Net-1K (Deng et al., 2009), Fashion MNIST (Xiao et al., 2017), Street View House Number (SVHN) dataset (Netzer et al., 2011), CIFAR-100 (Krizhevsky, 2009) |
| Dataset Splits | Yes | For Image Net, we sample 500 images from the test set, following prior work. ...focus on a subset of 2000 certified samples for CIFAR-10. ...We certify 500 samples on Fashion MNIST rather than the full test-set on CIFAR-10 due to the cost of randomized smoothing. ...We certify 500 samples on SVHN rather than the full test-set on CIFAR-10 due to the cost of randomized smoothing. ...We certify 500 samples on CIFAR-100 rather than the full test-set on CIFAR-10 due to the cost of randomized smoothing. |
| Hardware Specification | Yes | Our implementation utilises the torch.sparse package in version 2.0 (Paszke et al., 2019) and runs in less than 2 minutes for 7000 certified data points and 15 bins on a Nvidia RTX 3090. ...We mostly use A40 GPUs and equivalent older models. |
| Software Dependencies | Yes | Our implementation utilises the torch.sparse package in version 2.0 (Paszke et al., 2019) |
| Experiment Setup | Yes | We train using SGD with batch size 256 and weight decay of 0.0001. We use a learning rate of 2v ϵ T with factor v as additional hyperparameter. ...We fine-tune for 10 epochs with a linear warm-up schedule for ϵ that reaches full size at epoch 3. We decrease the learning rate for the model weights every 4 epochs by a factor of 0.1. ...all of the runs are performed on a batch size of 2048. |