reproducibilityindex.ai

Trainable Calibration Measures for Neural Networks from Kernel Mean Embeddings

Authors: Aviral Kumar, Sunita Sarawagi, Ujjwal Jain

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on several network architectures demonstrate that MMCE is a fast, stable, and accurate method to minimize calibration error metrics while maximally preserving the number of high conﬁdence predictions.
Researcher Affiliation	Academia	Aviral Kumar 1 Sunita Sarawagi 1 Ujjwal Jain 1 1Department of Computer Science and Engineering, IIT Bombay, Mumbai, India.
Pseudocode	No	The paper describes mathematical formulations and processes (e.g., in Section 3 A Trainable Calibration Measure from Kernel Embeddings and Section 3.1 Minimizing MMCE during training), but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code is partially available and will be made fully available at https://github.com/aviralkumar2907/MMCE.
Open Datasets	Yes	The datasets used in our experiments are: 1. CIFAR-10 (Krizhevsky et al., a): Color images (32 32) from 10 classes. 45,000/5,000/10,000 images for train/validation/test. 2. CIFAR-100 (Krizhevsky et al., b): Same as above but with 100 classes. 3. Caltech Birds 200 (Welinder et al., 2010): Images of 200 bird species drawn from Imagenet. 5994/2897/2897 images for train/validation/test sets. 4. 20 Newsgroups: News articles partitioned into 20 categories by content. 15098/900/3999 documents for train/validation/test. 5. IMDB reviews (Maas et al., 2011): Polar movie reviews for sentiment classiﬁcation 25000/5000/20000 for train/ validation/ test. 6. UC Irvine Human Activity Recognition(HAR) (Anguita et al., 2013): Time series from phones corresponding to 6 human actions. 6653/699/2947 instances for train/ validation/ test. 7. Stanford Sentiment Treebank (SST) (Socher et al., 2012): Movie reviews, represented as parse trees that are annotated by sentiment. Each sample includes a binary label and a ﬁne grained 5-class label. We used the binary version. Training/validation/test sets contain 6920/872/1821 documents.
Dataset Splits	Yes	CIFAR-10 (Krizhevsky et al., a): Color images (32 32) from 10 classes. 45,000/5,000/10,000 images for train/validation/test.
Hardware Specification	Yes	We compared the running times per epoch of the baseline and the MMCE trained model (on a NVIDIA Titan X GPU)
Software Dependencies	No	The paper refers to several publicly available models and codebases (e.g., Tensorﬂow, 2018, Keras Team, 2018) but does not explicitly list the specific versions of these or any other software dependencies (like Python, PyTorch, CUDA) required to replicate their experiments.
Experiment Setup	Yes	The λ for weighting MMCE wrt NLL is chosen via cross-validation. The same kernel k(r, r ) was used for all since r, r are probabilities. We chose the Laplacian Kernel k(r, r ) = exp \|r r \| 0.4 , a universal kernel, with a width of 0.4. For measuring calibration error we use ECE with 20 bins, each of size 0.05. We use a batch size of 128, except when the default batch size in the code base was higher. For example, in the IMDB HAN codebase, the default batch size was 256 and for UCI HAR the batch size was 1500. Other details about optimizer and hyperparameters were kept unchanged from the base models from the downloaded source. As an exception, the batch size for SST was kept ﬁxed to 25 (the default value).