reproducibilityindex.ai

Gradient-based Visual Explanation for Transformer-based CLIP

Authors: Chenyang Zhao, Kun Wang, Xingyu Zeng, Rui Zhao, Antoni B. Chan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Qualitative and quantitative evaluations verify the superiority of Grad-ECLIP compared with the state-of-the-art methods. and In this section we conduct experiments on Grad-ECLIP to: 1) evaluate its visual explanation qualitatively and quantitatively, and compare with the current SOTA methods; 2) evaluate the processing time; 3) gain insight about CLIP by analyzing the visual explanations.
Researcher Affiliation	Collaboration	Chenyang Zhao 1 2 Kun Wang 2 Xingyu Zeng 2 Rui Zhao 2 Antoni B. Chan 1 1 Department of Computer Science, City University of Hong Kong, Hong Kong 2 Sense Time Group Ltd.
Pseudocode	No	The paper describes its method using mathematical formulations and descriptive text, but no explicit pseudocode or algorithm block is provided.
Open Source Code	Yes	Codes are available here: https://github.com/Cyang-Zhao/Grad-Eclip.
Open Datasets	Yes	We conducted the experiments with the Vi T-B/16 architecture. ... MS COCO (Lin et al., 2014). Image Net (Russakovsky et al., 2015), Image Net-Segmentation (Image Net-S) (Gao et al., 2022), CLEVR (Johnson et al., 2017), Image Net-R (Hendrycks et al., 2021a), Image Net-Sketch (Wang et al., 2019a), Image Net-A (Hendrycks et al., 2021b), Conceptual Captions (CC) (Sharma et al., 2018), chest x-ray with text (MSCXR (Boecking et al., 2022)).
Dataset Splits	Yes	The model performance is measured using top-1 or top-5 zero-shot classification accuracy on the validation set of Image Net (Russakovsky et al., 2015) (ILSVRC) 2012, consisting of 50K images from 1000 classes.
Hardware Specification	No	The paper does not specify the hardware (e.g., CPU, GPU models) used for running experiments.
Software Dependencies	No	The paper mentions 'Pytorch' but does not specify its version number or any other software dependencies with their versions.
Experiment Setup	Yes	We conducted the experiments with the Vi T-B/16 architecture. and In the experiments, we use the last layer to explain the image encoder, and the last eight layers for interpreting the text encoder. The ablation study for the influence of different number of layers involved in image and text explanation is shown in Appendix. and The explanation faithfulness has the trend that it first increases with more layers used and then goes down with the lower-layer features involved (N > 8). Therefore, we aggregate the last eight layers maps for interpreting the text encoder in our experiments.