Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Mix-n-Match : Ensemble and Compositional Methods for Uncertainty Calibration in Deep Learning
Authors: Jize Zhang, Bhavya Kailkhura, T. Yong-Jin Han
ICML 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our approaches outperform state-of-the-art solutions on both the calibration as well as the evaluation tasks in most of the experimental settings. Our codes are available at https://github.com/zhang64llnl/Mix-n-Match-Calibration. |
| Researcher Affiliation | Academia | Jize Zhang 1 Bhavya Kailkhura 1 T. Yong-Jin Han 1 1Lawrence Livermore National Laboratories Livermore, CA 994550. Correspondence to: Jize Zhang <EMAIL>. |
| Pseudocode | No | The paper describes procedural steps for algorithms (e.g., IRM in Section 3.3.2) but does not present them in a structured pseudocode block or algorithm figure. |
| Open Source Code | Yes | Our codes are available at https://github.com/zhang64llnl/Mix-n-Match-Calibration. |
| Open Datasets | Yes | We calibrate various deep neural network classifiers on popular computer vision datasets: CIFAR-10/100 (Krizhevsky, 2009) with 10/100 classes and Image Net (Deng et al., 2009) with 1000 classes. |
| Dataset Splits | Yes | We use 45000 images for training and hold out 15000 images for calibration and evaluation. For Image Net, we acquired 4 pretrained models from (Paszke et al., 2019) which were trained with 1.3 million images, and 50000 images are hold out for calibration and evaluation. We randomly split the hold-out dataset into nc = 5000, ne = 10000 for CIFAR-10/100 and nc = ne = 25000 for Image Net. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory used for running experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. While it references the PyTorch paper, it does not state the PyTorch version or any other software versions used. |
| Experiment Setup | No | The paper states that "The training detail is described in Sec. S6.", indicating that specific experimental setup details such as hyperparameters are not present in the main text. |