Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Dissecting Query-Key Interaction in Vision Transformers
Authors: Xu Pan, Aaron Philip, Ziqian Xie, Odelia Schwartz
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We utilized a dataset that has been applied to studying visual salience [24], namely the Odd-One-Out (O3) dataset [29]. |
| Researcher Affiliation | Academia | 1University of Miami 2Harvard University 3Michigan State University 4University of Texas Health Science Center at Houston |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code for this work is available at: https://github.com/schwartz-cnl/Dissecting Vi T. |
| Open Datasets | Yes | We utilized a dataset that has been applied to studying visual salience [24], namely the Odd-One-Out (O3) dataset [29]. For each mode, we show the top 8 images in the Imagenet (Hugging Face version) [36] validation set that induce the largest attention score. |
| Dataset Splits | Yes | For each mode, we show the top 8 images in the Imagenet (Hugging Face version) [36] validation set that induce the largest attention score. |
| Hardware Specification | No | Our experiments do not require compute resources beyond a personal computer with a GPU. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | In this study, the "attention score" is defined as the dot product of every query and key pair, which has the shape of the number of tokens by the number of tokens and is defined per attention head. The "attention map" is the softmax of each query s attention score reshaped into a 2D image, which is defined per attention head and token. |