Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Concept Activation Regions: A Generalized Framework For Concept-Based Explanations
Authors: Jonathan Crabbé, Mihaela van der Schaar
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 3 Experiments The code to reproduce all the experiments from this section is available at https://github.com/ Jonathan Crabbe/CARs and https://github.com/vanderschaarlab/CARs. 3.1 Empirical Evaluation Our purpose is to empirically validate the formalism described in the previous section. We have several independent components to evaluate: 1 the concept classifier used to detect the CARs Hc, 2 the global explanations induced by the TCAR values and 3 the feature importance scores induced by the concept densities c. Datasets. We perform our experiments on 3 datasets. |
| Researcher Affiliation | Academia | Jonathan Crabbé University of Cambridge EMAIL Mihaela van der Schaar University of Cambridge The Alan Turing Institute UCLA EMAIL |
| Pseudocode | Yes | The implementation of our method closely follows Algorithms 1, 2 and 4 in the appendices. |
| Open Source Code | Yes | The code to reproduce all the experiments from this section is available at https://github.com/ Jonathan Crabbe/CARs and https://github.com/vanderschaarlab/CARs. |
| Open Datasets | Yes | Datasets. We perform our experiments on 3 datasets. 1 The MNIST dataset [56]... 2 The MIT-BIH Electrocardiogram (ECG) dataset [57, 58]... 3 The Caltech-UCSD Birds-200 (CUB) dataset [59]... We use the data collected with the Surveillance, Epidemiology, and End Results (SEER) Program. The dataset [69]... |
| Dataset Splits | Yes | We train a multilayer perceptron (MLP) to predict the patient s mortality on 90% of the data and test on the remaining 10%. The classifier is then evaluated by computing its accuracy on a holdout balanced concept set T c of size 100 sampled from the model s testing set. |
| Hardware Specification | No | Our computing resources are described in Appendices E and F. These appendices are not included in the provided text, so specific hardware details are not explicitly described in the main body. |
| Software Dependencies | No | The paper mentions using Python, Scikit-learn, PyTorch, and Captum, and refers to software dependencies being described in Appendices E and F. However, the provided text does not specify version numbers for these software components. For example, [71] cites 'Scikit-learn: Machine learning in Python' and [72] cites 'Captum: A unified and generic model interpretability library for Py Torch' without specific version numbers for Scikit-learn, PyTorch, or Captum. |
| Experiment Setup | Yes | For several of those latent spaces, we fit our CAR classifier (SVC with radial basis function kernel) to discriminate the concept sets Pc, N c for each concept c [C]. These two sets have a size N c = 200 and are sampled from the model s training set. We train a multilayer perceptron (MLP)... We fit a CAR classifier (SVC with linear kernel) to discriminate the concepts sets Pc, N c for each grade c [5]. Both of those sets contain N c = 250 patients sampled from the training set. |