Beyond Concept Bottleneck Models: How to Make Black Boxes Intervenable?

Authors: Sonia Laguna, Ričards Marcinkevičs, Moritz Vandenhirtz, Julia Vogt

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we explore the intervenability of black-box classifiers on synthetic tabular and natural image benchmarks.
Researcher Affiliation Academia Sonia Laguna , Riˇcards Marcinkeviˇcs , Moritz Vandenhirtz, Julia E. Vogt Department of Computer Science, ETH Zurich, Switzerland
Pseudocode Yes Pseudocode implementation can be found as part of Algorithm B.1 in Appendix B.
Open Source Code Yes The code is available in a repository at https://github.com/sonialagunac/Beyond-CBM.
Open Datasets Yes Datasets We evaluate the proposed methods on synthetic and real-world benchmarks summarised in Table D.1 (Appendix D). For controlled experiments, we have adapted the nonlinear synthetic tabular dataset from Marcinkeviˇcs et al. (2024). Another benchmark we consider is the Animals with Attributes 2 (Aw A2) natural image dataset (Lampert et al., 2009; Xian et al., 2019)... Caltech-UCSD Birds-200-2011 (CUB) dataset (Wah et al., 2011)... CIFAR-10 (Krizhevsky et al., 2009) and the large-scale Image Net (Russakovsky et al., 2015)... MIMIC-CXR (Johnson et al., 2019) and Che Xpert (Irvin et al., 2019) datasets...
Dataset Splits Yes Unless mentioned otherwise, we mainly focus on the simplest scenario shown in Figure D.1(a). Below, we outline each generative process in detail. Throughout this appendix, let N, p, and K denote the number of independent data points {(xi, ci, yi)}N i=1, covariates, and concepts, respectively. Across all experiments, we set N = 50,000, p = 1,500, and K = 30. This dataset was divided according to the 60%-20%-20% train-validation-test split. (for Synthetic) This dataset was divided according to the 60%-20%-20% train-validation-test split. (for Aw A2) To generate the validation set, we randomly hold out 10,000 images from the training data to remain faithful to the original test set. (for CIFAR-10) In our experiments, we allocate half of the validation as the test set. (for Image Net) Both chest radiograph datasets are divided according to the 80%-10%-10% train-validation-test split. (for Che Xpert and MIMIC-CXR)
Hardware Specification Yes We run the reported experiments on a cluster of Ge Force RTX 2080 GPUs with a single CPU worker.
Software Dependencies Yes All methods were implemented using Py Torch (v 1.12.1) (Paszke et al., 2019) and scikit-learn (v 1.0.2) (Pedregosa et al., 2011).
Experiment Setup Yes For the synthetic data, CBMs and black-box classifiers are trained for 150 and 100 epochs, respectively, with a learning rate of 10 4 and a batch size of 64. Across all other experiments, CBMs are trained for 350 epochs and black-box models for 300 epochs with a learning rate of 10 4 halved midway through training and a batch size of 64. All probes were trained on the validation data for 150 epochs with a learning rate of 10 2 using the stochastic gradient descent (SGD) optimiser. Finally, all fine-tuning procedures were run for 150 epochs with a learning rate of 10 4 and a batch size of 64 using the Adam optimiser. Image Net is an exception to the above configurations due to its large size. The black-box models in this dataset were trained for 60 epochs, and the probes and fine-tuning procedures for 20 epochs. At test time, interventions were performed on batches of 512 data points.