Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Bootstrap Your Uncertainty: Adaptive Robust Classification Driven by Optimal-Transport

Authors: Jiawei Huang, Minming Li, Hu Ding

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we present our experimental results across diverse distribution shift scenarios, which demonstrate that our approach significantly outperforms existing methods, achieving state-of-the-art robustness. (from abstract) ... 4 Experiments
Researcher Affiliation	Academia	1School of Computer Science and Technology, University of Science and Technology of China 2Department of Computer Science, City University of Hong Kong
Pseudocode	Yes	Algorithm 1 OT Driven Adaptive Distributionally Robust Optimization (Ada DRO) ... Algorithm 2 MLMC-RT Gradient Estimation for Sinkhorn DRO
Open Source Code	Yes	Anonymized code and data are included in the supplemental material.
Open Datasets	Yes	Datasets. We evaluate on three widely studied distribution shift settings: Colored MNIST [2], which tests robustness under spurious correlations; Waterbirds [50], a real-world dataset with strong background-label correlation; Celeb A [43], a benchmark for facial attribute recognition. We also evaluate on several long-tailed benchmark datasets: CIFAR-10-LT and CIFAR-100-LT [37].
Dataset Splits	No	The imbalance in these datasets is quantified by the imbalance factor (IF), defined as the ratio between the number of samples in the most frequent class and that in the least frequent class. We evaluate our model under three imbalance levels: IF=10, IF=50, and IF=100, representing increasing levels of class imbalance.
Hardware Specification	Yes	All models are implemented with Py Torch on a single NVIDIA RTX 6000 Ada GPU
Software Dependencies	No	All models are implemented with Py Torch on a single NVIDIA RTX 6000 Ada GPU using the Adam W optimizer [45].
Experiment Setup	Yes	The model is optimized using the Adam W optimizer [45]. For Ada DRO, we use cosine similarity as the kernel in (10), and employ basic augmentations (flip, crop) for semantic calibration in Sec. 3.2 unless otherwise specified.