Visual Explanations of Image-Text Representations via Multi-Modal Information Bottleneck Attribution
Authors: Ying Wang, Tim G. J. Rudner, Andrew G. Wilson
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform an extensive empirical evaluation and demonstrate on several datasets including healthcare data that can be used in safety-critical settings that multi-modal information bottleneck attribution significantly outperforms existing gradient-based, perturbation-based, and attention-based attribution methods, quantitatively and qualitatively. |
| Researcher Affiliation | Academia | Ying Wang New York University Tim G. J. Rudner New York University Andrew Gordon Wilson New York University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code for our experiments is available at: https://github.com/YingWANGG/M2IB. |
| Open Datasets | Yes | Our main datasets are (i) Conceptual Captions [26] consisting of diverse images and captions from the web, and (ii) MS-CXR (Local Alignment Chest X-ray dataset; [3]), which contains chest X-rays and texts describing radiological findings, complementing MIMIC-CXR (MIMIC Chest X-ray; [13]) by improving the bounding boxes and captions. |
| Dataset Splits | Yes | We split the testing dataset into 80% training data and 20% validation data. |
| Hardware Specification | Yes | In contrast, M2IB attribution only requires 100 forward passes and takes 1.2s for one image-text pair. 7.8s on RTX8000 with a batch size of 256. |
| Software Dependencies | No | The paper mentions software components like 'Adam optimizer' and refers to a 'pytorch-gradcam repository', but does not provide specific version numbers for these or other key software dependencies like Python or PyTorch. |
| Experiment Setup | Yes | For each {image, caption} pair, we insert an information bottleneck into the given layer of the text encoder and image encoder of CLIP separately, then train the bottleneck using the same setup as the Per-Sample Bottleneck of original IBA [24], which duplicates a single sample for 10 times to stabilize training and runs 10 iterations using the Adam optimizer with a learning rate of 1. Experiments show no significant difference between different learning rates and more training steps. We conduct a hyper-parameter tuning on the index of the layer l, the scaling factor β, and the variance σ2. |