Decoupling Feature Extraction and Classification Layers for Calibrated Neural Networks
Authors: Mikkel Jordahn, Pablo M. Olmos
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We illustrate these methods improve calibration across Vi T and WRN architectures for several image classification benchmark datasets. We run a number of experiments to verify the benefits of TST and V-TST. |
| Researcher Affiliation | Academia | Mikkel Jordahn 1 Pablo M. Olmos 2 1Cognitive Systems, Technical University of Denmark, Kongens Lyngby, Denmark 2Signal Processing Group (GTS), Universidad Carlos III de Madrid, Madrid, Spain. |
| Pseudocode | Yes | We show the algorithm for TST in Algorithm 1. Algorithm 1 TST 1: Init. DNN M w. parameters {β, ϕ}. 2: Stage 1: Train M with CE loss on Dtrain until convergence or early stopped. 3: Freeze parameters β of M. 4: Re-init, FC layers of M w. parameters {θ, ν}. 5: Stage 2: Train {θ, ν} of M with CE loss on Dtrain until convergence. |
| Open Source Code | Yes | We provide the code for reproducing our main results at https://github.com/MJordahn/ Decoupled-Layers-for-Calibrated-NNs. |
| Open Datasets | Yes | We demonstrate how these methods significantly improve calibration metrics on CIFAR10, CIFAR100 and SVHN and for different model architectures, in particular Wide Residual Networks (WRN) (Zagoruyko & Komodakis, 2016) and Vision Transformers (Vi T) (Dosovitskiy et al., 2021). We fine-tune it for a classification task on Tiny Image Net (Le & Yang, 2015). |
| Dataset Splits | Yes | We compute the validation loss based on a validation set, which is data we split from the training set. For CIFAR10 and SVHN we use 15% of the training set for validation, whilst for CIFAR100 we use only 5% of the data for validation. |
| Hardware Specification | No | The paper discusses training costs and speeds but does not provide specific hardware details (e.g., GPU models, CPU types) used for the experiments. |
| Software Dependencies | No | The paper mentions using 'Adam optimizer' and references specific model implementations like 'WRN 28-10 as it is specified in Zagoruyko and Komodakis (2016)' and 'The Vi T model is based on the implementation in Dosovitskiy et al. (2021)', but does not specify software versions (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We always train with Adam optimizer (Kingma & Ba, 2014). We train it for 600 epochs using Adam optimizer with learning rate 10 4, but employ early stopping based on the validation loss. We use a patch size of 4, a token dim of size 512, depth of size 6, 8 heads, MLP dim of size 512 and head dimension of size 64. We use dropout in both the Transformer and the embeddings with p = 0.1. We only train for 40 additional epochs, but in most of our experiments, much less are required to converge. |