reproducibilityindex.ai

Classification with Conceptual Safeguards

Authors: Hailey Joren, Charles Thomas Marx, Berk Ustun

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We benchmark our approach on a collection of real-world and synthetic datasets, showing that it can improve performance and coverage in deep learning tasks. We present experiments where we benchmark conceptual safeguards on a collection of real-world classiﬁcation datasets. Our goal is to evaluate their accuracy and coverage trade-offs, and to study the effect of uncertainty propagation and conﬁrmation through ablation studies.
Researcher Affiliation	Academia	Hailey Joren UC San Diego hjoren@ucsd.edu Charles Marx Stanford University ctmarx@stanford.edu Berk Ustun UC San Diego berk@ucsd.edu
Pseudocode	Yes	Algorithm 1 Greedy Concept Selection
Open Source Code	Yes	We include details on our setup and results in Appendix B, and provide code to reproduce our results on Git Hub.
Open Datasets	Yes	The melanoma and skincancer datasets are image classiﬁcation tasks to diagnose melanoma and skin cancer derived from the Derm7pt dataset [25]. The warbler and flycatcher datasets are image classiﬁcation tasks derived from the Cal Tech-UCSD Birds dataset [26]
Dataset Splits	No	We split each dataset into a training sample (80%, used to build a selective classiﬁcation model) and a test sample (20%, used to evaluate coverage and selective accuracy in deployment).
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions using 'standard deep learning algorithm' and 'logistic regression', and refers to 'Inception V3' as a pre-trained model, but does not provide specific software dependencies with version numbers.
Experiment Setup	Yes	We report the performance of each model through an accuracy-coverage curve as in Fig. 2, which plots its coverage and selective accuracy on the test sample across thresholds. We control the number of examples to conﬁrm by setting a conﬁrmation budget, and plot accuracycoverage curves for conﬁrmation budgets of 0/10/20/50%. Table 2, Prediction Thresholds: = 0.05, = 0.1, = 0.15, = 0.2.