Classification with Conceptual Safeguards

Authors: Hailey Joren, Charles Thomas Marx, Berk Ustun

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We benchmark our approach on a collection of real-world and synthetic datasets, showing that it can improve performance and coverage in deep learning tasks. We present experiments where we benchmark conceptual safeguards on a collection of real-world classification datasets. Our goal is to evaluate their accuracy and coverage trade-offs, and to study the effect of uncertainty propagation and confirmation through ablation studies.
Researcher Affiliation Academia Hailey Joren UC San Diego hjoren@ucsd.edu Charles Marx Stanford University ctmarx@stanford.edu Berk Ustun UC San Diego berk@ucsd.edu
Pseudocode Yes Algorithm 1 Greedy Concept Selection
Open Source Code Yes We include details on our setup and results in Appendix B, and provide code to reproduce our results on Git Hub.
Open Datasets Yes The melanoma and skincancer datasets are image classification tasks to diagnose melanoma and skin cancer derived from the Derm7pt dataset [25]. The warbler and flycatcher datasets are image classification tasks derived from the Cal Tech-UCSD Birds dataset [26]
Dataset Splits No We split each dataset into a training sample (80%, used to build a selective classification model) and a test sample (20%, used to evaluate coverage and selective accuracy in deployment).
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions using 'standard deep learning algorithm' and 'logistic regression', and refers to 'Inception V3' as a pre-trained model, but does not provide specific software dependencies with version numbers.
Experiment Setup Yes We report the performance of each model through an accuracy-coverage curve as in Fig. 2, which plots its coverage and selective accuracy on the test sample across thresholds. We control the number of examples to confirm by setting a confirmation budget, and plot accuracycoverage curves for confirmation budgets of 0/10/20/50%. Table 2, Prediction Thresholds: = 0.05, = 0.1, = 0.15, = 0.2.