Leveraging Uncertainty Estimates To Improve Classifier Performance

Authors: Gundeep Arora, Srujana Merugu, Anoop Saladi, Rajeev Rastogi

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluation of the proposed algorithms on three real-world datasets yield 25%-40% gain in recall at high precision bounds over the traditional approach of using model score alone, highlighting the benefits of leveraging uncertainty.
Researcher Affiliation Industry Gundeep Arora, Srujana Merugu, Anoop Saladi, Rajeev Rastogi Amazon gundeepa@amazon.com
Pseudocode Yes Algorithm 1 Optimal Equi-weight DP-based Multi-Thresholds [EW-DPMT]
Open Source Code No Moreover, we will publicly open-source our code later after we cleanup our code package and add proper documentation for it.
Open Datasets Yes Criteo: An online advertising dataset consisting of 45 MM ad impressions with click outcomes, each with 13 continuous and 26 categorical features. We use the split of 72% : 18% : 10% for train-validation-test from the benchmark, (ii) Avazu: Another CTR prediction dataset comprising 40 MM samples each with 22 features describing user and ad attributes. We use the train-validation-test splits of 70% : 10% : 20%, from the benchmark
Dataset Splits Yes For Criteo and Avazu, we use the SAM architecture (Cheng & Xue, 2021)... For Criteo... We use the split of 72% : 18% : 10% for train-validation-test from the benchmark... (ii) Avazu... We use the train-validation-test splits of 70% : 10% : 20%, from the benchmark... (iii) E-Com... We create train-validation-test sets in the proportion 50% : 12% : 38% from different time periods.
Hardware Specification Yes All models were trained on NVIDIA 16GB V100 GPU. We provide the pseudo code of binning and all algorithms implemented in Sec. 5 and Appendix G with details of bin-configuration in Sec 6.2. All binning and decision boundary related operations were performed on 4-core machine using Intel Xeon processor 2.3 GHz (Broadwell E5-2686 v4) running Linux.
Software Dependencies No In our implementation, we use the isotonic regression implementation is scikit-learn, which has linear time in terms of the input size for L2 loss (Stout, 2013). (Version number for scikit-learn is missing).
Experiment Setup Yes For Criteo and Avazu, we use the SAM architecture (Cheng & Xue, 2021) as the backbone with 1 fully-connected layer and 6 radial flow layers for class distribution estimation. For E-Com, we trained a FT-Transformer (Gorishniy et al., 2021) backbone with 8 radial flow layers. Binning strategies: We consider two options: (i) Equi-span where the uncertainty and score ranges are divided into equal sized K and L intervals, respectively... (ii) Equi-weight... Table 1 shows the recall at high precision bounds for various decision boundary algorithms on three large-scale datasets with 500 score and 3 uncertainty bins, averaged over 5 runs with different seeds.