reproducibilityindex.ai

A Closer Look at the Intervention Procedure of Concept Bottleneck Models

Authors: Sungbin Shin, Yohan Jo, Sungsoo Ahn, Namhoon Lee

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We verify our findings through comprehensive evaluations, not only on the standard real datasets, but also on synthetic datasets that we generate based on a set of different causal graphs. We experiment with three datasets: (1) CUB (Wah et al., 2011) the standard dataset used to study CBMs, (2) Skin Con (Daneshjou et al., 2022b) a medical dataset used to build interpretable models, and (3) Synthetic the synthetic datasets we generate based on different causal graphs to conduct a wide range of controlled experiments.
Researcher Affiliation	Collaboration	1POSTECH, South Korea 2Amazon Alexa AI, USA.
Pseudocode	Yes	Algorithm 1 Generating synthetic data
Open Source Code	Yes	Our code is available at https://github.com/ssbin4/Closer-Intervention-CBM.
Open Datasets	Yes	CUB (Wah et al., 2011) is the standard dataset used to study CBMs in the previous works (Koh et al., 2020; Zarlenga et al., 2022; Havasi et al., 2022; Sawada & Nakamura, 2022). Skin Con (Daneshjou et al., 2022b) is a medical dataset which can be used to build interpretable machine learning models.
Dataset Splits	Yes	Since training and test sets are not specified in the Skin Con dataset, we randomly split the dataset into 70%, 15%, 15% of training, validation, and test sets respectively. We randomly divide the generated examples into 70% of training sets, 15% of validation sets, and 15% of test sets.
Hardware Specification	Yes	τi ≈ 0.7 and τg ≈ 18.7 × 10−3 and τf ≈ 0.03 × 10−3 are acquired by measuring the inference time with RTX 3090 GPU and taking the average of 300 repetitions.
Software Dependencies	No	The paper mentions using Inception-v3, but it does not specify specific software versions for libraries or frameworks used (e.g., Python version, PyTorch version, etc.) beyond the model architecture.
Experiment Setup	Yes	We used λ = 0.01 for JNT and JNT+P whose values were directly taken from Koh et al. (2020). For the experiments without majority voting (Figure 30 in Appendix H), we use Inceptionv3 pretrained on the Imagenet for g and 2-layer MLP for f with the dimensionality of 200 so that it can describe more complex functions. We searched the best hyperparameters for both g and f over the same sets of values as in Koh et al. (2020). Specifically, we tried initial learning rates of [0.01, 0.001], constant learning rate and decaying the learning rate by 0.1 every [10, 15, 20] epoch, and the weight decay of [0.0004, 0.00004].