Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Tailoring Mixup to Data for Calibration

Authors: Quentin Bouniot, Pavlo Mozharovskyi, Florence d'Alché-Buc

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide extensive experiments for classification and regression tasks, showing that our proposed method improves predictive performance and calibration of models, while being much more efficient. We quantitatively ascertain the effectiveness of our Similarity Kernel Mixup with extensive experiments on multiple datasets, from image classification to regression tasks, and multiple deep neural network architectures.
Researcher Affiliation Academia 1LTCI, Télécom Paris, Institut Polytechnique de Paris, France, 2Technical University of Munich, 3Helmholtz Munich, 4Munich Center for Machine Learning (MCML)
Pseudocode Yes We present a pseudocode of our Similarity Kernel Mixup procedure for a single training iteration in Algorithm 1. The generation of new data is explained in the pseudocode as a sequential process for simplicity and ease of understanding, but the actual implementation is optimized to work in parallel on GPU through vectorized operations.
Open Source Code Yes Code is available at https://github.com/qbouniot/sim_kernel_mixup
Open Datasets Yes Image Classification We follow experimental settings from previous works (Liu et al., 2022a; Pinto et al., 2022; Wang et al., 2023; Noh et al., 2023) and evaluate our approach on CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009), Tiny-Imagenet (Deng et al., 2009) and Imagenet (Russakovsky et al., 2015) datasets for In-Distribution (ID) performance and calibration.
Dataset Splits Yes We selected the values giving the best trade-off between accuracy and calibration using cross-validation, with a stratified sampling on a 90/10 split of the training set, similarly to Pinto et al. (2022), and average the results across 4 different splits.
Hardware Specification Yes We present a comparison in Table 8, in terms of performance (accuracy and calibration) and computation time, of the three different variants of implementation discussed when training a Res Net34 on CIFAR10 on a single A100 GPU (τmax = 1, τstd = 0.4).
Software Dependencies No The paper mentions software like SGD, Adam, Dropout, and PyTorch (implicitly via torch.distributions.Beta), but does not provide specific version numbers for any of these.
Experiment Setup Yes On CIFAR10 and CIFAR100, we use SGD as the optimizer with a momentum of 0.9 and weight decay of 10-4, a batch size of 128, and the standard augmentations random crop, horizontal flip and normalization. Models are trained for 200 epochs, with an initial learning rate of 0.1 divided by a factor 10 after 80 and 120 epochs.