An Empirical Study Into What Matters for Calibrating Vision-Language Models
Authors: Weijie Tu, Weijian Deng, Dylan Campbell, Stephen Gould, Tom Gedeon
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this study, we explore the calibration properties of VLMs across different architectures, datasets, and training strategies. and Through detailed experimentation, we highlight the potential applications and importance of our insights, aiming for more reliable and effective use of VLMs in critical, real-world scenarios. |
| Researcher Affiliation | Academia | 1The Australian National University 2Curtin University 3University of Obuda. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions that models used are 'publicly accessible through Open CLIP (Ilharco et al., 2021) and TIMM (Wightman, 2019)' and refers to other third-party libraries like Hugging Face and LAVIS. However, it does not provide an explicit statement or link to the source code for the methodology described in this paper. |
| Open Datasets | Yes | We assay the uncertainty estimation of VLMs on three standard image classification benchmarks: Image Net (Deng et al., 2009), CIFAR-10 (Krizhevsky et al., 2009) and Domain Net (Peng et al., 2019) and Appendix A.1 lists Image Net (Deng et al., 2009) (https://www.image-net.org/) and others. |
| Dataset Splits | Yes | Following the protocol in (Gupta et al., 2021), we divide the validation set of Image Net into two halves: one for the in-distribution (ID) test set, and the other for learning calibration methods. and For CIFAR-10, its validation set is used for model calibration, and CIFAR-10.1, CIFAR-10.2 (Recht et al., 2018b) and CINIC (Darlow et al., 2018) are used for evaluation. |
| Hardware Specification | Yes | All experiment is run on one 3090 and the CPU AMD EPYC 7343 16-Core Processor. |
| Software Dependencies | Yes | Py Torch version is 1.10.0+cu111 and timm version is 0.8.21dev0. |
| Experiment Setup | Yes | For model calibration, we by default use temperature scaling (Guo et al., 2017) on calibration sets. and Throughout the experiments, we estimate ECE using equal-mass binning and 15 bins. |