Decoupling Feature Extraction and Classification Layers for Calibrated Neural Networks

Authors: Mikkel Jordahn, Pablo M. Olmos

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We illustrate these methods improve calibration across Vi T and WRN architectures for several image classification benchmark datasets. We run a number of experiments to verify the benefits of TST and V-TST.
Researcher Affiliation Academia Mikkel Jordahn 1 Pablo M. Olmos 2 1Cognitive Systems, Technical University of Denmark, Kongens Lyngby, Denmark 2Signal Processing Group (GTS), Universidad Carlos III de Madrid, Madrid, Spain.
Pseudocode Yes We show the algorithm for TST in Algorithm 1. Algorithm 1 TST 1: Init. DNN M w. parameters {β, ϕ}. 2: Stage 1: Train M with CE loss on Dtrain until convergence or early stopped. 3: Freeze parameters β of M. 4: Re-init, FC layers of M w. parameters {θ, ν}. 5: Stage 2: Train {θ, ν} of M with CE loss on Dtrain until convergence.
Open Source Code Yes We provide the code for reproducing our main results at https://github.com/MJordahn/ Decoupled-Layers-for-Calibrated-NNs.
Open Datasets Yes We demonstrate how these methods significantly improve calibration metrics on CIFAR10, CIFAR100 and SVHN and for different model architectures, in particular Wide Residual Networks (WRN) (Zagoruyko & Komodakis, 2016) and Vision Transformers (Vi T) (Dosovitskiy et al., 2021). We fine-tune it for a classification task on Tiny Image Net (Le & Yang, 2015).
Dataset Splits Yes We compute the validation loss based on a validation set, which is data we split from the training set. For CIFAR10 and SVHN we use 15% of the training set for validation, whilst for CIFAR100 we use only 5% of the data for validation.
Hardware Specification No The paper discusses training costs and speeds but does not provide specific hardware details (e.g., GPU models, CPU types) used for the experiments.
Software Dependencies No The paper mentions using 'Adam optimizer' and references specific model implementations like 'WRN 28-10 as it is specified in Zagoruyko and Komodakis (2016)' and 'The Vi T model is based on the implementation in Dosovitskiy et al. (2021)', but does not specify software versions (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We always train with Adam optimizer (Kingma & Ba, 2014). We train it for 600 epochs using Adam optimizer with learning rate 10 4, but employ early stopping based on the validation loss. We use a patch size of 4, a token dim of size 512, depth of size 6, 8 heads, MLP dim of size 512 and head dimension of size 64. We use dropout in both the Transformer and the embeddings with p = 0.1. We only train for 40 additional epochs, but in most of our experiments, much less are required to converge.