An Empirical Study Into What Matters for Calibrating Vision-Language Models

Authors: Weijie Tu, Weijian Deng, Dylan Campbell, Stephen Gould, Tom Gedeon

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this study, we explore the calibration properties of VLMs across different architectures, datasets, and training strategies. and Through detailed experimentation, we highlight the potential applications and importance of our insights, aiming for more reliable and effective use of VLMs in critical, real-world scenarios.
Researcher Affiliation Academia 1The Australian National University 2Curtin University 3University of Obuda.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper mentions that models used are 'publicly accessible through Open CLIP (Ilharco et al., 2021) and TIMM (Wightman, 2019)' and refers to other third-party libraries like Hugging Face and LAVIS. However, it does not provide an explicit statement or link to the source code for the methodology described in this paper.
Open Datasets Yes We assay the uncertainty estimation of VLMs on three standard image classification benchmarks: Image Net (Deng et al., 2009), CIFAR-10 (Krizhevsky et al., 2009) and Domain Net (Peng et al., 2019) and Appendix A.1 lists Image Net (Deng et al., 2009) (https://www.image-net.org/) and others.
Dataset Splits Yes Following the protocol in (Gupta et al., 2021), we divide the validation set of Image Net into two halves: one for the in-distribution (ID) test set, and the other for learning calibration methods. and For CIFAR-10, its validation set is used for model calibration, and CIFAR-10.1, CIFAR-10.2 (Recht et al., 2018b) and CINIC (Darlow et al., 2018) are used for evaluation.
Hardware Specification Yes All experiment is run on one 3090 and the CPU AMD EPYC 7343 16-Core Processor.
Software Dependencies Yes Py Torch version is 1.10.0+cu111 and timm version is 0.8.21dev0.
Experiment Setup Yes For model calibration, we by default use temperature scaling (Guo et al., 2017) on calibration sets. and Throughout the experiments, we estimate ECE using equal-mass binning and 15 bins.