Dissecting Query-Key Interaction in Vision Transformers
Authors: Xu Pan, Aaron Philip, Ziqian Xie, Odelia Schwartz
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We utilized a dataset that has been applied to studying visual salience [24], namely the Odd-One-Out (O3) dataset [29]. |
| Researcher Affiliation | Academia | 1University of Miami 2Harvard University 3Michigan State University 4University of Texas Health Science Center at Houston |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code for this work is available at: https://github.com/schwartz-cnl/Dissecting Vi T. |
| Open Datasets | Yes | We utilized a dataset that has been applied to studying visual salience [24], namely the Odd-One-Out (O3) dataset [29]. For each mode, we show the top 8 images in the Imagenet (Hugging Face version) [36] validation set that induce the largest attention score. |
| Dataset Splits | Yes | For each mode, we show the top 8 images in the Imagenet (Hugging Face version) [36] validation set that induce the largest attention score. |
| Hardware Specification | No | Our experiments do not require compute resources beyond a personal computer with a GPU. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | In this study, the "attention score" is defined as the dot product of every query and key pair, which has the shape of the number of tokens by the number of tokens and is defined per attention head. The "attention map" is the softmax of each query s attention score reshaped into a 2D image, which is defined per attention head and token. |