Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Counterfactually Comparing Abstaining Classifiers
Authors: Yo Joong Choe, Aditya Gangrade, Aaditya Ramdas
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our approach is examined in both simulated and real data experiments. ... We present our results in Table 2. ... To illustrate a real data use case, we compare abstaining classifiers on the CIFAR-100 image classification dataset (Krizhevsky, 2009). |
| Researcher Affiliation | Academia | Yo Joong Choe Data Science Institute University of Chicago EMAIL Aditya Gangrade Department of EECS University of Michigan EMAIL Aaditya Ramdas Dept. of Statistics and Data Science Machine Learning Department Carnegie Mellon University EMAIL |
| Pseudocode | No | The paper describes the methods textually and mathematically but does not include any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | All code for the experiments is publicly available online at https://github.com/yjchoe/Comparing Abstaining Classifiers. |
| Open Datasets | Yes | To illustrate a real data use case, we compare abstaining classifiers on the CIFAR-100 image classification dataset (Krizhevsky, 2009). |
| Dataset Splits | No | The paper mentions using a 'validation set' for CIFAR-100 but does not provide specific split percentages or sample counts for training, validation, or test sets in the main text. |
| Hardware Specification | No | The paper mentions using 'XSEDE' and the 'Bridges-2 system' at the 'Pittsburgh Supercomputing Center (PSC)', but it does not specify any particular hardware components like CPU or GPU models, or their specifications. |
| Software Dependencies | No | The paper does not mention any specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | For the nuisance functions, we try linear predictors (L2-regularized linear/logistic regression for หยต0/หฯ), random forests, and super learners with k-NN, kernel SVM, and random forests. ... use the same softmax output layer but use a different threshold for abstentions. Specifically, both classifiers use the softmax response (SR) thresholding (Geifman and El-Yaniv, 2017), i.e., abstain if maxc Y f(X)c < ฯ for a threshold ฯ > 0, but A uses a more conservative threshold (ฯ = 0.8) than B (ฯ = 0.5). |