Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention

Authors: Saebom Leem, Hyunseok Seo

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental As a result, our method outperforms the previous leading explainability methods of Vi T in the weakly-supervised localization task and presents great capability in capturing the full instances of the target class object. Meanwhile, our method provides a visualization that faithfully explains the model, which is demonstrated in the perturbation comparison test.In this section, we present the results of the performance comparison of our method with previous leading methods.
Researcher Affiliation Academia Saebom Leem1,2, Hyunseok Seo1* 1Korea Institute of Science and Technology 2Sogang University toqha1215@sogang.ac.kr, seo@kist.kr
Pseudocode No The paper describes its methodology using text and mathematical equations (e.g., Eq. 1-8) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or a direct link for open-source code for the methodology described.
Open Datasets Yes For the evaluation, we used the validation set of Image Net ILSVRC 2012 (Russakovsky et al. 2015) and Pascal VOC 2012 (Everingham et al. 2012) and the test set of Caltech-UCSD Birds-200-2011 (CUB 200) (Wah et al. 2011), which provide the bounding-box annotation label.
Dataset Splits Yes The result of the weakly-supervised object detection on the Image Net ILSVRC 2012 validation set is presented in Table 1.The localization performance on the Pascal VOC 2012 validation is presented in Table 2.
Hardware Specification No The paper mentions evaluating methods with a 'Vi T-base model' but does not specify any hardware details like GPU or CPU models used for the experiments.
Software Dependencies No The paper does not provide specific software dependency details with version numbers, such as programming language versions or library versions (e.g., PyTorch, TensorFlow, etc.).
Experiment Setup Yes All methods are evaluated with the same Vi T-base (Dosovitskiy et al. 2020) model that takes the input image with a size of [224 224 3]. All methods share the same model parameters and the fine-tuning details of the model parameters are provided in the supplementary material. In this Vi T, the input images are converted into [14 14] number of patches and therefore each method generates a heatmap with a size of [14 14 1] where one pixel corresponds to the contribution of one image patch of the input image.