Visual Explanations of Image-Text Representations via Multi-Modal Information Bottleneck Attribution

Authors: Ying Wang, Tim G. J. Rudner, Andrew G. Wilson

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform an extensive empirical evaluation and demonstrate on several datasets including healthcare data that can be used in safety-critical settings that multi-modal information bottleneck attribution significantly outperforms existing gradient-based, perturbation-based, and attention-based attribution methods, quantitatively and qualitatively.
Researcher Affiliation Academia Ying Wang New York University Tim G. J. Rudner New York University Andrew Gordon Wilson New York University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes The code for our experiments is available at: https://github.com/YingWANGG/M2IB.
Open Datasets Yes Our main datasets are (i) Conceptual Captions [26] consisting of diverse images and captions from the web, and (ii) MS-CXR (Local Alignment Chest X-ray dataset; [3]), which contains chest X-rays and texts describing radiological findings, complementing MIMIC-CXR (MIMIC Chest X-ray; [13]) by improving the bounding boxes and captions.
Dataset Splits Yes We split the testing dataset into 80% training data and 20% validation data.
Hardware Specification Yes In contrast, M2IB attribution only requires 100 forward passes and takes 1.2s for one image-text pair. 7.8s on RTX8000 with a batch size of 256.
Software Dependencies No The paper mentions software components like 'Adam optimizer' and refers to a 'pytorch-gradcam repository', but does not provide specific version numbers for these or other key software dependencies like Python or PyTorch.
Experiment Setup Yes For each {image, caption} pair, we insert an information bottleneck into the given layer of the text encoder and image encoder of CLIP separately, then train the bottleneck using the same setup as the Per-Sample Bottleneck of original IBA [24], which duplicates a single sample for 10 times to stabilize training and runs 10 iterations using the Adam optimizer with a learning rate of 1. Experiments show no significant difference between different learning rates and more training steps. We conduct a hyper-parameter tuning on the index of the layer l, the scaling factor β, and the variance σ2.