Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention
Authors: Saebom Leem, Hyunseok Seo
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | As a result, our method outperforms the previous leading explainability methods of Vi T in the weakly-supervised localization task and presents great capability in capturing the full instances of the target class object. Meanwhile, our method provides a visualization that faithfully explains the model, which is demonstrated in the perturbation comparison test.In this section, we present the results of the performance comparison of our method with previous leading methods. |
| Researcher Affiliation | Academia | Saebom Leem1,2, Hyunseok Seo1* 1Korea Institute of Science and Technology 2Sogang University toqha1215@sogang.ac.kr, seo@kist.kr |
| Pseudocode | No | The paper describes its methodology using text and mathematical equations (e.g., Eq. 1-8) but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement or a direct link for open-source code for the methodology described. |
| Open Datasets | Yes | For the evaluation, we used the validation set of Image Net ILSVRC 2012 (Russakovsky et al. 2015) and Pascal VOC 2012 (Everingham et al. 2012) and the test set of Caltech-UCSD Birds-200-2011 (CUB 200) (Wah et al. 2011), which provide the bounding-box annotation label. |
| Dataset Splits | Yes | The result of the weakly-supervised object detection on the Image Net ILSVRC 2012 validation set is presented in Table 1.The localization performance on the Pascal VOC 2012 validation is presented in Table 2. |
| Hardware Specification | No | The paper mentions evaluating methods with a 'Vi T-base model' but does not specify any hardware details like GPU or CPU models used for the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependency details with version numbers, such as programming language versions or library versions (e.g., PyTorch, TensorFlow, etc.). |
| Experiment Setup | Yes | All methods are evaluated with the same Vi T-base (Dosovitskiy et al. 2020) model that takes the input image with a size of [224 224 3]. All methods share the same model parameters and the fine-tuning details of the model parameters are provided in the supplementary material. In this Vi T, the input images are converted into [14 14] number of patches and therefore each method generates a heatmap with a size of [14 14 1] where one pixel corresponds to the contribution of one image patch of the input image. |