reproducibilityindex.ai

Concept Bottleneck Models

Authors: Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, Percy Liang

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate concept bottleneck models on the two applications in Figure 1: the osteoarthritis grading task (Nevitt et al., 2006) and a fine-grained bird species identification task (Wah et al., 2011). On these tasks, we show that bottleneck models are competitive with standard end-to-end models while also attaining high concept accuracies.
Researcher Affiliation	Collaboration	1Stanford University 2Google Research.
Pseudocode	No	The paper describes the different bottleneck models and training schemes in prose but does not include any formal pseudocode blocks or algorithms.
Open Source Code	Yes	The code for replicating our experiments is available on GitHub at https://github.com/yewsiang/ConceptBottleneck.
Open Datasets	Yes	We use knee x-rays from the Osteoarthritis Initiative (OAI) (Nevitt et al., 2006)... We use the Caltech-UCSD Birds-200-2011 (CUB) dataset (Wah et al., 2011)... While we are unable to release the OAI dataset publicly, an application to access the data can be made at https://nda.nih.gov/oai/.
Dataset Splits	Yes	For the OAI task... We use a 70/10/20 train/validation/test split. The CUB dataset... We use the standard 60/40 train/test split, and randomly set aside 20% of the training images as the validation set. (From Appendix A)
Hardware Specification	No	The paper describes the software models and datasets used but does not provide specific details about the hardware (e.g., GPU, CPU models) used for training or experimentation.
Software Dependencies	No	The paper mentions deep learning models (ResNet-18, Inception-v3) and optimizers (Adam) but does not provide specific version numbers for any software libraries or dependencies (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	For the joint bottleneck model, we search over the task-concept tradeoff hyperparameter λ and report results for the model that has the highest task accuracy while maintaining high concept accuracy on the validation set (λ = 1 for OAI and λ = 0.01 for CUB). We model x-ray grading as a regression problem (minimizing mean squared error)... we finetune a pretrained ResNet-18 model... and use a small 3-layer MLP for c y. (From Appendix B: We use the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 1e-4, batch size of 32, and train for 50 epochs.)