Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Asymmetric Duos: Sidekicks Improve Uncertainty

Authors: Tim G. Zhou, Evan Shelhamer, Geoff Pleiss

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Across five image classification benchmarks and a variety of model architectures and training schemes (including soups), Asymmetric Duos significantly improve accuracy, uncertainty quantification, and selective classification metrics with only 10 20% more computation. Code is available at: https://github.com/timgzhou/asymmetric-duos.
Researcher Affiliation	Academia	Tim G. Zhou1,2 Evan Shelhamer1,2 Geoff Pleiss1,2 1University of British Columbia 2Vector Institute
Pseudocode	No	The paper describes the methodology and procedures in narrative text, but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Code is available at: https://github.com/timgzhou/asymmetric-duos.
Open Datasets	Yes	Datasets. We benchmark Image Net [8] for large-scale evaluation, and Caltech 256 [20] and i Wild Cam [3] for measuring transfer to new domains. As accuracy and uncertainty quantification tend to degrade under distribution shifts [42], we further evaluate with Image Net V2 [48] and i Wild Cam-OOD [27] to test robustness under such shifts.
Dataset Splits	Yes	Image Net [8] ... We use only 5% of the official test split as our validation set to show how data-efficient the temperatureweighting step is. ... Caltech 256 [20] ... We use a random 15% split of the dataset as val and another 15% as test. ... i Wild Cam (OOD) [3, 27] ... We use the official IND val split for i Wild Cam, which contains 7315 images captured by the same set of camera traps as the training set.
Hardware Specification	Yes	All models are trained on NVIDIA L40S GPUs.
Software Dependencies	No	The paper mentions software like 'torchvision [37]', 'timm [59]', and 'Adam W [36]' but does not provide specific version numbers for any of these key software components, which is required for reproducibility.
Experiment Setup	Yes	For all our fine-tuned experiments we follow the LP FT recipe from Kumar et al. [28]: we first train only the final classification head (also known as Linear Probing, or LP) for a few epochs (8 for both Caltech 256 and i Wild Cam) using cross-entropy loss and Adam W, then unfreeze the entire network and Fine-Tune (FT) for more epochs (16 on Caltech 256, 12 on i Wild Cam) under the same loss and optimizer. Our FT procedure employs a sequential schedule (linear warm-up followed by cosine annealing), and we perform hyper-parameter search over learning rate in [1 × 10−6, 3 × 10−4] and weight decay in [1 × 10−8, 1 × 10−5], picking the best trial by validation score. We use a batch size of 128 for Caltech 256 and 16 for i Wild Cam. All models are trained on NVIDIA L40S GPUs. Random Augmentation [5] is used to augment training samples during the FT phase. The best LP checkpoint initializes FT, and the top FT model by validation performance is carried forward into our Duo evaluations. We calibrate all fine-tuned models by temperature scaling [21], minimizing the negative log likelihood on the validation set using L-BFGS, to improve baseline uncertainty quantification.