reproducibilityindex.ai

Learning to Intervene on Concept Bottlenecks

Authors: David Steinmann, Wolfgang Stammer, Felix Friedrich, Kristian Kersting

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We illustrate the full potential of CB2M in our experimental evaluations on several challenging tasks, such as handling distribution shifts and confounding factors across several datasets.
Researcher Affiliation	Academia	1Artificial Intelligence and Machine Learning Group, TU Darmstadt, Germany 2Hessian Center for Artificial Intelligence (hessian.AI), Darmstadt, Germany 3Centre for Cognitive Science, TU Darmstadt, Germany 4German Center for Artificial Intelligence (DFKI).
Pseudocode	Yes	For reference, we present algorithms with pseudo-code for mistake detection (Alg. 1) and intervention generalization (Alg. 2).
Open Source Code	Yes	1code is available at: https://github.com/ml-research/CB2M
Open Datasets	Yes	Data: The Caltech-UCSD Birds (CUB) dataset (Wah et al., 2011) consists of roughly 12 000 images of 200 bird classes. We use the data splits provided by Koh et al. (2020), resulting in training, validation, and test sets with 40, 10, and 50% of the total images. Additionally, we add 4 training and validation folds to perform 5-fold validation. Images in the dataset are annotated with 312 concepts (e.g., beakcolor:black, beak-color:brown, etc.), which can be grouped into concept groups (one group for all beak-color: concepts). We follow the approach of previous work (Koh et al., 2020; Chauhan et al., 2022) and use only concepts that occur for at least 10 classes and then perform majority voting on the concept values for each class. This results in 112 concepts from 28 groups. We also include experiments on a new, confounded version of CUB, noted as CUB (conf.). We further provide evidence based on the MNIST (Le Cun & Cortes, 1998), confounded Color MNIST (C-MNIST) (Rieger et al., 2020) and SVHN (Netzer et al., 2011) datasets.
Dataset Splits	Yes	We use the data splits provided by Koh et al. (2020), resulting in training, validation, and test sets with 40, 10, and 50% of the total images. Additionally, we add 4 training and validation folds to perform 5-fold validation.
Hardware Specification	No	The paper does not explicitly describe the specific hardware used for running its experiments, such as GPU or CPU models.
Software Dependencies	No	The paper mentions models like Inception-v3 and MLPs, but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	For CUB, we use the same model setup as (Koh et al., 2020), instantiating the bottleneck model with the Inception-v3 architecture (Szegedy et al., 2016) and the predictor network with a simple multi-layer perceptron (MLP). On the Parity MNIST, SVHN, and C-MNIST datasets, we used an MLP both for the bottleneck and predictor networks. The bottleneck is a two-layer MLP with a hidden dimension of 120 and Re LU activation functions, while the predictor is a single-layer MLP. The bottlenecks are trained using the specific dataset s respective training and validation sets. ... CB2M parameters are tuned for generalization and detection separately on the training and validation set (cf. App. A.8). For all detection experiments, the memory of CB2M is filled with wrongly classified instances of the validation set according to the parameters. For generalization experiments, we simulate human interventions on the validation set and use CB2M to generalize them to the test set. ... The detailed hyperparameter for each setup can be found in Tab. 13. For further training setup, e.g., learning rates, we refer to the code.