A Benchmark Study on Calibration
Authors: Linwei Tao, Younan Zhu, Haolan Guo, Minjing Dong, Chang Xu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our study leverages the Neural Architecture Search (NAS) search space, offering an exhaustive model architecture space for thorough calibration properties exploration. We specifically create a model calibration dataset. This dataset evaluates 90 bin-based and 12 additional calibration measurements across 117,702 unique neural networks within the widely employed NATS-Bench search space. Our analysis aims to answer several longstanding questions in the field, using our proposed dataset: |
| Researcher Affiliation | Academia | Linwei Tao University of Sydney linwei.tao@sydney.edu.au, Younan Zhu, Haolan Guo University of Sydney {yzhu0986, hguo4658}@uni.sydney.edu.au, Minjing Dong City University of Hong Kong minjdong@cityu.edu.hk, Chang Xu University of Sydney c.xu@sydney.edu.au |
| Pseudocode | No | The paper describes its experimental procedures and analyses in detail, but it does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The project page can be found at https://www.taolinwei.com/calibration-study. |
| Open Datasets | Yes | Each unique architecture is pretrained for 200 epochs on three benchmark datasets: CIFAR-10 (Krizhevsky et al., 2009), CIFAR-100 (Krizhevsky et al., 2009), and ImageNet16-120 (Chrabaszcz et al., 2017). |
| Dataset Splits | Yes | To evaluate post temperature scaling, we create a validation set by splitting the original test set into 20%/80% for validation/test. |
| Hardware Specification | No | The paper discusses the computational costs of training models and states training durations (e.g., 'pretrained for 200 epochs'), but it does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper describes the datasets and experimental setup, but it does not list any specific software dependencies (e.g., programming languages, libraries, or frameworks) along with their version numbers. |
| Experiment Setup | Yes | Each unique architecture is pretrained for 200 epochs on three benchmark datasets: CIFAR-10, CIFAR-100, and ImageNet16-120... We evaluate these metrics across a wide range of bin sizes, including 5, 10, 15, 20, 25, 50, 100, 200, and 500 bins. These metrics are assessed both before and after temperature scaling... Each transformer is fine tuned 60 epochs on CIFAR-10, CIFAR-100 and ImageNet-16-120 based on the pretrained weights on ImageNet-1k. |