Proximity-Informed Calibration for Deep Neural Networks

Authors: Miao Xiong, Ailin Deng, Pang Wei W. Koh, Jiaying Wu, Shen Li, Jianqing Xu, Bryan Hooi

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We examine the problem over 504 pretrained Image Net models and observe that:...
Researcher Affiliation Collaboration Miao Xiong1 Ailin Deng1 Pang Wei Koh23 Jiaying Wu1 Shen Li1 Jianqing Xu Bryan Hooi1 1 National University of Singapore 2 University of Washington 3 Google
Pseudocode Yes We present the procedural steps of our approach in the form of a pseudocode. Algorithm 1 encompasses the general inference phase: ... Algorithm 2 encapsulates the Density-Ratio Calibration algorithm.
Open Source Code Yes Our codes are available at: https://github.com/Miao Xiong2320/Proximity Bias-Calibration.git
Open Datasets Yes We evaluate the effectiveness of our approach across large-scale datasets of three types of data characteristics (balanced, long-tail and distribution-shifted) in image and text domains: (1) Dataset with balanced class distribution (i.e. each class has an equal size of samples) on vision dataset Image Net [7] and two text datasets including Yahoo Answers Topics [49] and Multi NLI-match [44]; (2) Datasets with long-tail class distribution on two image datasets, including i Naturalist 2021 [3] and Image Net-LT [27]; (3) Dataset with distribution-shift on three datasets, including Image Net-C [15], Multi NLI-Mismatch [44] and Image Net-Sketch [42].
Dataset Splits Yes Specifically, we randomly split the hold-out dataset into calibration set and evaluation set nc = ne = 25000 for Image Net, nc = ne = 50000 for i Naturalist 2021 and nc = ne = 5000 for Image Net-LT. For Yahoo and Multi NLI-Match dataset, we sample 20% data from the training dataset as calibration set and use the original test dataset as the test dataset.
Hardware Specification Yes The result is reported in Table 3. Compared to the confidence baseline, our method, Bin-Mean-Shift, exhibits a slight increase of 1.17% in runtime, while Density-Ratio introduces a modest overhead of 12.3%. These results demonstrate that our method incurs minimal computational overhead while achieving comparable runtime efficiency to the other baseline methods. In addition, it is worth noting that the cost of computing proximity has been reduced due to the recent advancement in neighborhood search algorithms. In our implementation, we employ index Flat L2 from the Meta open-sourced GPU-accelerated Faiss library [18] to calculate each sample s nearest neighbor. This algorithm enables us to reduce the time for nearest neighbor search to approximately 0.04 ms per sample (shown in Table 3). The computation overhead beyond the neighbor search is actually quite similar to isotonic regression (IR) and histogram binning (HB), which leads to the total time being roughly twice that of isotonic regression (0.04 + 0.05 0.1s). ... on a single Nvidia GTX 2080 Ti.
Software Dependencies No The paper mentions using 'statsmodel library' and 'faiss library', but does not provide specific version numbers for these software dependencies. It also refers to 'Scikit Learn' without a version.
Experiment Setup Yes Regarding nearest neighbor computation, we use index Flat L2 from faiss [18]. Except the Hyperparamerter sensitivity experiments, we use K = 10 for the proximity computation. Regarding our method, for Density-Ratio, the kernel density estimation for two variables are implemented using statsmodel library [37]. For the Bin Mean-Shift method, we set the regularization parameter λ = 0.5. For the calibration setup, we adopt a standard calibration setup [10] with a fixed-size calibration set (i.e. validation set) and evaluation test datasets. Specifically, we randomly split the hold-out dataset into calibration set and evaluation set nc = ne = 25000 for Image Net, nc = ne = 50000 for i Naturalist 2021 and nc = ne = 5000 for Image Net-LT. For Yahoo and Multi NLI-Match dataset, we sample 20% data from the training dataset as calibration set and use the original test dataset as the test dataset. For the evaluation, we use random seed 2020, 2021, 2022, 2023, 2024 and compute the mean (this does not apply to NLP dataset for its fixed test set).