Dissecting Query-Key Interaction in Vision Transformers

Authors: Xu Pan, Aaron Philip, Ziqian Xie, Odelia Schwartz

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We utilized a dataset that has been applied to studying visual salience [24], namely the Odd-One-Out (O3) dataset [29].
Researcher Affiliation Academia 1University of Miami 2Harvard University 3Michigan State University 4University of Texas Health Science Center at Houston
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Code for this work is available at: https://github.com/schwartz-cnl/Dissecting Vi T.
Open Datasets Yes We utilized a dataset that has been applied to studying visual salience [24], namely the Odd-One-Out (O3) dataset [29]. For each mode, we show the top 8 images in the Imagenet (Hugging Face version) [36] validation set that induce the largest attention score.
Dataset Splits Yes For each mode, we show the top 8 images in the Imagenet (Hugging Face version) [36] validation set that induce the largest attention score.
Hardware Specification No Our experiments do not require compute resources beyond a personal computer with a GPU.
Software Dependencies No The paper does not provide specific software dependencies with version numbers.
Experiment Setup Yes In this study, the "attention score" is defined as the dot product of every query and key pair, which has the shape of the number of tokens by the number of tokens and is defined per attention head. The "attention map" is the softmax of each query s attention score reshaped into a 2D image, which is defined per attention head and token.