Gradient-based Visual Explanation for Transformer-based CLIP

Authors: Chenyang Zhao, Kun Wang, Xingyu Zeng, Rui Zhao, Antoni B. Chan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Qualitative and quantitative evaluations verify the superiority of Grad-ECLIP compared with the state-of-the-art methods. and In this section we conduct experiments on Grad-ECLIP to: 1) evaluate its visual explanation qualitatively and quantitatively, and compare with the current SOTA methods; 2) evaluate the processing time; 3) gain insight about CLIP by analyzing the visual explanations.
Researcher Affiliation Collaboration Chenyang Zhao 1 2 Kun Wang 2 Xingyu Zeng 2 Rui Zhao 2 Antoni B. Chan 1 1 Department of Computer Science, City University of Hong Kong, Hong Kong 2 Sense Time Group Ltd.
Pseudocode No The paper describes its method using mathematical formulations and descriptive text, but no explicit pseudocode or algorithm block is provided.
Open Source Code Yes Codes are available here: https://github.com/Cyang-Zhao/Grad-Eclip.
Open Datasets Yes We conducted the experiments with the Vi T-B/16 architecture. ... MS COCO (Lin et al., 2014). Image Net (Russakovsky et al., 2015), Image Net-Segmentation (Image Net-S) (Gao et al., 2022), CLEVR (Johnson et al., 2017), Image Net-R (Hendrycks et al., 2021a), Image Net-Sketch (Wang et al., 2019a), Image Net-A (Hendrycks et al., 2021b), Conceptual Captions (CC) (Sharma et al., 2018), chest x-ray with text (MSCXR (Boecking et al., 2022)).
Dataset Splits Yes The model performance is measured using top-1 or top-5 zero-shot classification accuracy on the validation set of Image Net (Russakovsky et al., 2015) (ILSVRC) 2012, consisting of 50K images from 1000 classes.
Hardware Specification No The paper does not specify the hardware (e.g., CPU, GPU models) used for running experiments.
Software Dependencies No The paper mentions 'Pytorch' but does not specify its version number or any other software dependencies with their versions.
Experiment Setup Yes We conducted the experiments with the Vi T-B/16 architecture. and In the experiments, we use the last layer to explain the image encoder, and the last eight layers for interpreting the text encoder. The ablation study for the influence of different number of layers involved in image and text explanation is shown in Appendix. and The explanation faithfulness has the trend that it first increases with more layers used and then goes down with the lower-layer features involved (N > 8). Therefore, we aggregate the last eight layers maps for interpreting the text encoder in our experiments.