Classification with Conceptual Safeguards
Authors: Hailey Joren, Charles Thomas Marx, Berk Ustun
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We benchmark our approach on a collection of real-world and synthetic datasets, showing that it can improve performance and coverage in deep learning tasks. We present experiments where we benchmark conceptual safeguards on a collection of real-world classification datasets. Our goal is to evaluate their accuracy and coverage trade-offs, and to study the effect of uncertainty propagation and confirmation through ablation studies. |
| Researcher Affiliation | Academia | Hailey Joren UC San Diego hjoren@ucsd.edu Charles Marx Stanford University ctmarx@stanford.edu Berk Ustun UC San Diego berk@ucsd.edu |
| Pseudocode | Yes | Algorithm 1 Greedy Concept Selection |
| Open Source Code | Yes | We include details on our setup and results in Appendix B, and provide code to reproduce our results on Git Hub. |
| Open Datasets | Yes | The melanoma and skincancer datasets are image classification tasks to diagnose melanoma and skin cancer derived from the Derm7pt dataset [25]. The warbler and flycatcher datasets are image classification tasks derived from the Cal Tech-UCSD Birds dataset [26] |
| Dataset Splits | No | We split each dataset into a training sample (80%, used to build a selective classification model) and a test sample (20%, used to evaluate coverage and selective accuracy in deployment). |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions using 'standard deep learning algorithm' and 'logistic regression', and refers to 'Inception V3' as a pre-trained model, but does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | We report the performance of each model through an accuracy-coverage curve as in Fig. 2, which plots its coverage and selective accuracy on the test sample across thresholds. We control the number of examples to confirm by setting a confirmation budget, and plot accuracycoverage curves for confirmation budgets of 0/10/20/50%. Table 2, Prediction Thresholds: = 0.05, = 0.1, = 0.15, = 0.2. |