Trainable Calibration Measures for Neural Networks from Kernel Mean Embeddings
Authors: Aviral Kumar, Sunita Sarawagi, Ujjwal Jain
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on several network architectures demonstrate that MMCE is a fast, stable, and accurate method to minimize calibration error metrics while maximally preserving the number of high confidence predictions. |
| Researcher Affiliation | Academia | Aviral Kumar 1 Sunita Sarawagi 1 Ujjwal Jain 1 1Department of Computer Science and Engineering, IIT Bombay, Mumbai, India. |
| Pseudocode | No | The paper describes mathematical formulations and processes (e.g., in Section 3 A Trainable Calibration Measure from Kernel Embeddings and Section 3.1 Minimizing MMCE during training), but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is partially available and will be made fully available at https://github.com/aviralkumar2907/MMCE. |
| Open Datasets | Yes | The datasets used in our experiments are: 1. CIFAR-10 (Krizhevsky et al., a): Color images (32 32) from 10 classes. 45,000/5,000/10,000 images for train/validation/test. 2. CIFAR-100 (Krizhevsky et al., b): Same as above but with 100 classes. 3. Caltech Birds 200 (Welinder et al., 2010): Images of 200 bird species drawn from Imagenet. 5994/2897/2897 images for train/validation/test sets. 4. 20 Newsgroups: News articles partitioned into 20 categories by content. 15098/900/3999 documents for train/validation/test. 5. IMDB reviews (Maas et al., 2011): Polar movie reviews for sentiment classification 25000/5000/20000 for train/ validation/ test. 6. UC Irvine Human Activity Recognition(HAR) (Anguita et al., 2013): Time series from phones corresponding to 6 human actions. 6653/699/2947 instances for train/ validation/ test. 7. Stanford Sentiment Treebank (SST) (Socher et al., 2012): Movie reviews, represented as parse trees that are annotated by sentiment. Each sample includes a binary label and a fine grained 5-class label. We used the binary version. Training/validation/test sets contain 6920/872/1821 documents. |
| Dataset Splits | Yes | CIFAR-10 (Krizhevsky et al., a): Color images (32 32) from 10 classes. 45,000/5,000/10,000 images for train/validation/test. |
| Hardware Specification | Yes | We compared the running times per epoch of the baseline and the MMCE trained model (on a NVIDIA Titan X GPU) |
| Software Dependencies | No | The paper refers to several publicly available models and codebases (e.g., Tensorflow, 2018, Keras Team, 2018) but does not explicitly list the specific versions of these or any other software dependencies (like Python, PyTorch, CUDA) required to replicate their experiments. |
| Experiment Setup | Yes | The λ for weighting MMCE wrt NLL is chosen via cross-validation. The same kernel k(r, r ) was used for all since r, r are probabilities. We chose the Laplacian Kernel k(r, r ) = exp |r r | 0.4 , a universal kernel, with a width of 0.4. For measuring calibration error we use ECE with 20 bins, each of size 0.05. We use a batch size of 128, except when the default batch size in the code base was higher. For example, in the IMDB HAN codebase, the default batch size was 256 and for UCI HAR the batch size was 1500. Other details about optimizer and hyperparameters were kept unchanged from the base models from the downloaded source. As an exception, the batch size for SST was kept fixed to 25 (the default value). |