Concept Bottleneck Models
Authors: Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, Percy Liang
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate concept bottleneck models on the two applications in Figure 1: the osteoarthritis grading task (Nevitt et al., 2006) and a fine-grained bird species identification task (Wah et al., 2011). On these tasks, we show that bottleneck models are competitive with standard end-to-end models while also attaining high concept accuracies. |
| Researcher Affiliation | Collaboration | 1Stanford University 2Google Research. |
| Pseudocode | No | The paper describes the different bottleneck models and training schemes in prose but does not include any formal pseudocode blocks or algorithms. |
| Open Source Code | Yes | The code for replicating our experiments is available on GitHub at https://github.com/yewsiang/ConceptBottleneck. |
| Open Datasets | Yes | We use knee x-rays from the Osteoarthritis Initiative (OAI) (Nevitt et al., 2006)... We use the Caltech-UCSD Birds-200-2011 (CUB) dataset (Wah et al., 2011)... While we are unable to release the OAI dataset publicly, an application to access the data can be made at https://nda.nih.gov/oai/. |
| Dataset Splits | Yes | For the OAI task... We use a 70/10/20 train/validation/test split. The CUB dataset... We use the standard 60/40 train/test split, and randomly set aside 20% of the training images as the validation set. (From Appendix A) |
| Hardware Specification | No | The paper describes the software models and datasets used but does not provide specific details about the hardware (e.g., GPU, CPU models) used for training or experimentation. |
| Software Dependencies | No | The paper mentions deep learning models (ResNet-18, Inception-v3) and optimizers (Adam) but does not provide specific version numbers for any software libraries or dependencies (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | For the joint bottleneck model, we search over the task-concept tradeoff hyperparameter λ and report results for the model that has the highest task accuracy while maintaining high concept accuracy on the validation set (λ = 1 for OAI and λ = 0.01 for CUB). We model x-ray grading as a regression problem (minimizing mean squared error)... we finetune a pretrained ResNet-18 model... and use a small 3-layer MLP for c y. (From Appendix B: We use the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 1e-4, batch size of 32, and train for 50 epochs.) |