Gradient-based Visual Explanation for Transformer-based CLIP
Authors: Chenyang Zhao, Kun Wang, Xingyu Zeng, Rui Zhao, Antoni B. Chan
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Qualitative and quantitative evaluations verify the superiority of Grad-ECLIP compared with the state-of-the-art methods. and In this section we conduct experiments on Grad-ECLIP to: 1) evaluate its visual explanation qualitatively and quantitatively, and compare with the current SOTA methods; 2) evaluate the processing time; 3) gain insight about CLIP by analyzing the visual explanations. |
| Researcher Affiliation | Collaboration | Chenyang Zhao 1 2 Kun Wang 2 Xingyu Zeng 2 Rui Zhao 2 Antoni B. Chan 1 1 Department of Computer Science, City University of Hong Kong, Hong Kong 2 Sense Time Group Ltd. |
| Pseudocode | No | The paper describes its method using mathematical formulations and descriptive text, but no explicit pseudocode or algorithm block is provided. |
| Open Source Code | Yes | Codes are available here: https://github.com/Cyang-Zhao/Grad-Eclip. |
| Open Datasets | Yes | We conducted the experiments with the Vi T-B/16 architecture. ... MS COCO (Lin et al., 2014). Image Net (Russakovsky et al., 2015), Image Net-Segmentation (Image Net-S) (Gao et al., 2022), CLEVR (Johnson et al., 2017), Image Net-R (Hendrycks et al., 2021a), Image Net-Sketch (Wang et al., 2019a), Image Net-A (Hendrycks et al., 2021b), Conceptual Captions (CC) (Sharma et al., 2018), chest x-ray with text (MSCXR (Boecking et al., 2022)). |
| Dataset Splits | Yes | The model performance is measured using top-1 or top-5 zero-shot classification accuracy on the validation set of Image Net (Russakovsky et al., 2015) (ILSVRC) 2012, consisting of 50K images from 1000 classes. |
| Hardware Specification | No | The paper does not specify the hardware (e.g., CPU, GPU models) used for running experiments. |
| Software Dependencies | No | The paper mentions 'Pytorch' but does not specify its version number or any other software dependencies with their versions. |
| Experiment Setup | Yes | We conducted the experiments with the Vi T-B/16 architecture. and In the experiments, we use the last layer to explain the image encoder, and the last eight layers for interpreting the text encoder. The ablation study for the influence of different number of layers involved in image and text explanation is shown in Appendix. and The explanation faithfulness has the trend that it first increases with more layers used and then goes down with the lower-layer features involved (N > 8). Therefore, we aggregate the last eight layers maps for interpreting the text encoder in our experiments. |