Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
What Does It Take to Build a Performant Selective Classifier?
Authors: Stephan Rabanser, Nicolas Papernot
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical validation. Our synthetic and real-world experiments confirm the decomposition: Bayes noise and capacity limits drive large gaps; temperature scaling improves calibration but not ranking; and shift-aware methods remain essential under distribution shift. These results clarify which factors matter most and how to target them effectively in practice. |
| Researcher Affiliation | Academia | Stephan Rabanser Princeton University EMAIL Nicolas Papernot University of Toronto & Vector Institute EMAIL |
| Pseudocode | Yes | 1 Loss Predictor( 2 (net): Sequential( 3 (0): Linear(in_features =512 , out_features =128 , bias=True) 4 (1): Re LU () 5 (2): Dropout(p=0.5) 6 (3): Linear(in_features =128 , out_features =64, bias=True) 7 (4): Re LU () 8 (5): Dropout(p=0.5) 9 (6): Linear(in_features =64, out_features =1, bias=True) |
| Open Source Code | Yes | We include our full experimental suite and details for reproducibility. |
| Open Datasets | Yes | Our synthetic and real-world experiments confirm the decomposition: Bayes noise and capacity limits drive large gaps; temperature scaling improves calibration but not ranking; and shift-aware methods remain essential under distribution shift. We validate our decomposition on synthetic two-moons data and on real-world vision and language benchmarks |
| Dataset Splits | Yes | For each method, we first trained a Res Net-18 on 80% of the training set (using the usual data augmentations and a held-out 20% for LP fitting). |
| Hardware Specification | Yes | Our experiments were conducted on a mix of GPU-equipped compute nodes with varying hardware configurations. Some machines are equipped with Intel Xeon Silver CPUs (10 cores, 20 threads) and 128GB of RAM, each hosting 4 NVIDIA Ge Force RTX 2080 Ti GPUs with 11GB VRAM. Others feature AMD EPYC 7643 processors (48 cores, 96 threads), 512GB of RAM, and 4 NVIDIA A100 GPUs, each with 80GB VRAM. |
| Software Dependencies | No | Below is the Py Torch representation of our two-hidden-layer LP head. It takes the Res Net features (optionally concatenated with SEP) and regresses the per-example 0 1 loss via mean-squared error. |
| Experiment Setup | Yes | For each architecture dataset pair, we use a fixed learning rate, weight decay, and batch size as detailed below: Simple CNN: Learning rate: 0.01 Weight decay: 1 10 4 Batch size: 128 Res Net-18: Learning rate: 0.1 for CIFAR datasets; 0.01 for Stanford Cars, Camelyon17 Weight decay: 5 10 4 Batch size: 128 Wide Res Net-50-2: Same settings as Res Net-18 200 epochs for all datasets except Camelyon17, which uses 10 Optimization: SGD with momentum 0.9, Nesterov enabled, and a cosine annealing learning rate schedule. |