reproducibilityindex.ai

An Empirical Study Into What Matters for Calibrating Vision-Language Models

Authors: Weijie Tu, Weijian Deng, Dylan Campbell, Stephen Gould, Tom Gedeon

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this study, we explore the calibration properties of VLMs across different architectures, datasets, and training strategies. and Through detailed experimentation, we highlight the potential applications and importance of our insights, aiming for more reliable and effective use of VLMs in critical, real-world scenarios.
Researcher Affiliation	Academia	1The Australian National University 2Curtin University 3University of Obuda.
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions that models used are 'publicly accessible through Open CLIP (Ilharco et al., 2021) and TIMM (Wightman, 2019)' and refers to other third-party libraries like Hugging Face and LAVIS. However, it does not provide an explicit statement or link to the source code for the methodology described in this paper.
Open Datasets	Yes	We assay the uncertainty estimation of VLMs on three standard image classification benchmarks: Image Net (Deng et al., 2009), CIFAR-10 (Krizhevsky et al., 2009) and Domain Net (Peng et al., 2019) and Appendix A.1 lists Image Net (Deng et al., 2009) (https://www.image-net.org/) and others.
Dataset Splits	Yes	Following the protocol in (Gupta et al., 2021), we divide the validation set of Image Net into two halves: one for the in-distribution (ID) test set, and the other for learning calibration methods. and For CIFAR-10, its validation set is used for model calibration, and CIFAR-10.1, CIFAR-10.2 (Recht et al., 2018b) and CINIC (Darlow et al., 2018) are used for evaluation.
Hardware Specification	Yes	All experiment is run on one 3090 and the CPU AMD EPYC 7343 16-Core Processor.
Software Dependencies	Yes	Py Torch version is 1.10.0+cu111 and timm version is 0.8.21dev0.
Experiment Setup	Yes	For model calibration, we by default use temperature scaling (Guo et al., 2017) on calibration sets. and Throughout the experiments, we estimate ECE using equal-mass binning and 15 bins.