A Benchmark Study on Calibration

Authors: Linwei Tao, Younan Zhu, Haolan Guo, Minjing Dong, Chang Xu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our study leverages the Neural Architecture Search (NAS) search space, offering an exhaustive model architecture space for thorough calibration properties exploration. We specifically create a model calibration dataset. This dataset evaluates 90 bin-based and 12 additional calibration measurements across 117,702 unique neural networks within the widely employed NATS-Bench search space. Our analysis aims to answer several longstanding questions in the field, using our proposed dataset:
Researcher Affiliation Academia Linwei Tao University of Sydney linwei.tao@sydney.edu.au, Younan Zhu, Haolan Guo University of Sydney {yzhu0986, hguo4658}@uni.sydney.edu.au, Minjing Dong City University of Hong Kong minjdong@cityu.edu.hk, Chang Xu University of Sydney c.xu@sydney.edu.au
Pseudocode No The paper describes its experimental procedures and analyses in detail, but it does not present any structured pseudocode or algorithm blocks.
Open Source Code Yes The project page can be found at https://www.taolinwei.com/calibration-study.
Open Datasets Yes Each unique architecture is pretrained for 200 epochs on three benchmark datasets: CIFAR-10 (Krizhevsky et al., 2009), CIFAR-100 (Krizhevsky et al., 2009), and ImageNet16-120 (Chrabaszcz et al., 2017).
Dataset Splits Yes To evaluate post temperature scaling, we create a validation set by splitting the original test set into 20%/80% for validation/test.
Hardware Specification No The paper discusses the computational costs of training models and states training durations (e.g., 'pretrained for 200 epochs'), but it does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper describes the datasets and experimental setup, but it does not list any specific software dependencies (e.g., programming languages, libraries, or frameworks) along with their version numbers.
Experiment Setup Yes Each unique architecture is pretrained for 200 epochs on three benchmark datasets: CIFAR-10, CIFAR-100, and ImageNet16-120... We evaluate these metrics across a wide range of bin sizes, including 5, 10, 15, 20, 25, 50, 100, 200, and 500 bins. These metrics are assessed both before and after temperature scaling... Each transformer is fine tuned 60 epochs on CIFAR-10, CIFAR-100 and ImageNet-16-120 based on the pretrained weights on ImageNet-1k.