Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Bootstrap Your Uncertainty: Adaptive Robust Classification Driven by Optimal-Transport
Authors: Jiawei Huang, Minming Li, Hu Ding
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we present our experimental results across diverse distribution shift scenarios, which demonstrate that our approach significantly outperforms existing methods, achieving state-of-the-art robustness. (from abstract) ... 4 Experiments |
| Researcher Affiliation | Academia | 1School of Computer Science and Technology, University of Science and Technology of China 2Department of Computer Science, City University of Hong Kong |
| Pseudocode | Yes | Algorithm 1 OT Driven Adaptive Distributionally Robust Optimization (Ada DRO) ... Algorithm 2 MLMC-RT Gradient Estimation for Sinkhorn DRO |
| Open Source Code | Yes | Anonymized code and data are included in the supplemental material. |
| Open Datasets | Yes | Datasets. We evaluate on three widely studied distribution shift settings: Colored MNIST [2], which tests robustness under spurious correlations; Waterbirds [50], a real-world dataset with strong background-label correlation; Celeb A [43], a benchmark for facial attribute recognition. We also evaluate on several long-tailed benchmark datasets: CIFAR-10-LT and CIFAR-100-LT [37]. |
| Dataset Splits | No | The imbalance in these datasets is quantified by the imbalance factor (IF), defined as the ratio between the number of samples in the most frequent class and that in the least frequent class. We evaluate our model under three imbalance levels: IF=10, IF=50, and IF=100, representing increasing levels of class imbalance. |
| Hardware Specification | Yes | All models are implemented with Py Torch on a single NVIDIA RTX 6000 Ada GPU |
| Software Dependencies | No | All models are implemented with Py Torch on a single NVIDIA RTX 6000 Ada GPU using the Adam W optimizer [45]. |
| Experiment Setup | Yes | The model is optimized using the Adam W optimizer [45]. For Ada DRO, we use cosine similarity as the kernel in (10), and employ basic augmentations (flip, crop) for semantic calibration in Sec. 3.2 unless otherwise specified. |