Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

[Re] On the Reproducibility of Post-Hoc Concept Bottleneck Models

Authors: Nesta Midavaine, Gregory Hok Tjoan Go, Diego Canez, Ioana Simion, Satchit Chatterji

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we reproduce and expand upon the findings of Yuksekgonul et al. (2023), showing that while their claims and results do generally hold, some of them could not be sufficiently replicated. Specifically, the claims relating to PCBM performance preservation and its non-requirement of labeled concept datasets were generally reproduced, whereas the one claiming its model editing capabilities was not. Beyond these results, our contributions to their work include evidence that PCBMs may work for audio classification problems, verification of the interpretability of their methods, and updates to their code for missing implementations.
Researcher Affiliation	Academia	EMAIL Graduate School of Informatics University of Amsterdam
Pseudocode	No	The paper describes methods and procedures in text but does not include any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	The code for our implementations can be found in https://github.com/dgcnz/FACT.
Open Datasets	Yes	In total, the original authors used seven different datasets for experimentation, either to evaluate the performance of PCBMs across different domains, the quality of generated CLIP concepts, or the results of global model editing. All datasets used for binary classification were evaluated using the Area-Under-Curve (AUC), the multi-class binary classification-based COCO-Stuff using the m AP, and the rest using accuracy. An overview of each dataset and its purpose can be found in Table 8. For COCO-Stuff and SIIM-ISIC, we followed the original paper to create subsets for each to reduce the required disk space for experimentation.1 The specifications for how they were created can be found in our repository.1The trimmed-down datasets can be found here: COCO-Stuff, SIIM-ISIC. For the model editing experiments and the survey, multiple datasets were generated using Metashift with the Visual Genome dataset.2The generated datasets can be found here: Model editing, Survey. As part of the audio classification extension, the ESC-50, Urban Sound8K, and Audio Set datasets were utilized (Salamon et al., 2014; Piczak, 2015).
Dataset Splits	Yes	For the SIIM-ISIC experiments, we implemented our data selection method based on the limited details provided by the authors. These details state that they utilized 2000 images (400 malignant, 1600 benign) for training and 500 images (100 malignant, 400 benign) for model evaluation. on a held-out set of 500 images (100 malignant, 400 benign). The training dataset consisted of 100 samples of a class with spurious correlation, while the test dataset comprised 100 samples of the same class with correlations to any concepts except the one used in training.
Hardware Specification	Yes	All of our experiments were conducted using Google Colab in region europe-west4, which has a carbon efficiency of 0.57 kg CO2eq/k Wh. However, most experiments were CPU-based as almost all training and evaluation was done only to evaluate the PCBM performances given the usage of pre-trained models for the backbones. As such, only the PCBM-h instances required GPU computation as they are neural networks. We utilized a T4 GPU and Intel(R) Xeon(R) CPU for these experiments, resulting in a total computational cost of roughly 30 CPU and 30 GPU hours for all experiments.
Software Dependencies	No	The paper mentions "bash and python scripts" and "Jupyter Notebooks" but does not specify any version numbers for these or other software libraries or dependencies.
Experiment Setup	Yes	For a comparable replication, we used the same hyperparameters specified in the original paper whenever they were. This was the case for everything apart from the regularization parameters Csvm and λlogistic regression. Csvm is used by the SVM for CAV computation. The open-source repository supplies the majority of the necessary code, including an example grid for fine-tuning C values, which is the following: [0.001, 0.01, 0.1, 1.0, 10.0]. Meanwhile, λlogistic regression is employed when investigating the original models for CIFAR10, CIFAR100, and COCO-Stuff. The original model is CLIP-Res Net50 for these three datasets, thus we determine the hyperparameter in the same way utilized by Radford et al. (2021). As such, we conduct a hyperparameter sweep on validation sets over a range from 10 6 to 106, with 96 logarithmically spaced steps. After testing many configurations, we decided to proceed using λ = 1.7, an L1 ratio of α = 0.99, BRODEN concepts, a CLIP-Res Net50 encoder, and the SGD optimizer. We experimented with the regularization and settled on λ = 0.002, which made our results consistent with the original. For Urban Sound8K, we set λCLIP = 2 10 4 and λCAV = 1 34 10. Similarly, for ESC-50, λCLIP = 2 10 10 and λCAV = 1 171 50. Early stopping was applied to the ESC-50 models due to observed overfitting.