Concept Bottleneck Models

Authors: Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, Percy Liang

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate concept bottleneck models on the two applications in Figure 1: the osteoarthritis grading task (Nevitt et al., 2006) and a fine-grained bird species identification task (Wah et al., 2011). On these tasks, we show that bottleneck models are competitive with standard end-to-end models while also attaining high concept accuracies.
Researcher Affiliation Collaboration 1Stanford University 2Google Research.
Pseudocode No The paper describes the different bottleneck models and training schemes in prose but does not include any formal pseudocode blocks or algorithms.
Open Source Code Yes The code for replicating our experiments is available on GitHub at https://github.com/yewsiang/ConceptBottleneck.
Open Datasets Yes We use knee x-rays from the Osteoarthritis Initiative (OAI) (Nevitt et al., 2006)... We use the Caltech-UCSD Birds-200-2011 (CUB) dataset (Wah et al., 2011)... While we are unable to release the OAI dataset publicly, an application to access the data can be made at https://nda.nih.gov/oai/.
Dataset Splits Yes For the OAI task... We use a 70/10/20 train/validation/test split. The CUB dataset... We use the standard 60/40 train/test split, and randomly set aside 20% of the training images as the validation set. (From Appendix A)
Hardware Specification No The paper describes the software models and datasets used but does not provide specific details about the hardware (e.g., GPU, CPU models) used for training or experimentation.
Software Dependencies No The paper mentions deep learning models (ResNet-18, Inception-v3) and optimizers (Adam) but does not provide specific version numbers for any software libraries or dependencies (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes For the joint bottleneck model, we search over the task-concept tradeoff hyperparameter λ and report results for the model that has the highest task accuracy while maintaining high concept accuracy on the validation set (λ = 1 for OAI and λ = 0.01 for CUB). We model x-ray grading as a regression problem (minimizing mean squared error)... we finetune a pretrained ResNet-18 model... and use a small 3-layer MLP for c y. (From Appendix B: We use the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 1e-4, batch size of 32, and train for 50 epochs.)