Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers

Authors: Arda Sahiner, Tolga Ergen, Batu Ozturkler, John Pauly, Morteza Mardani, Mert Pilanci

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we seek to compare the performance of the transformer heads we have analyzed in this work to baseline convex optimization methods. This comparison allows us to illustrate the implicit biases imposed by these novel heads in a practical example. In particular, we consider the task of training a single new block of these convex heads for performing an image classification task. Specifically, we seek to classify images from the CIFAR-100 dataset (Krizhevsky et al., 2009).
Researcher Affiliation Collaboration Arda Sahiner 1 Tolga Ergen 1 Batu Ozturkler 1 John Pauly 1 Morteza Mardani 2 Mert Pilanci 1 1Department of Electrical Engineering, Stanford University, Stanford, CA, USA 2NVIDIA Corporation, Santa Clara, CA, USA.
Pseudocode No The paper contains mathematical derivations and descriptions of methods but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper mentions using third-party libraries such as "Pytorch deep learning library" and "Pytorch Image Models library", but does not state that the source code for the methodology described in this paper is openly available or provide a link.
Open Datasets Yes Specifically, we seek to classify images from the CIFAR-100 dataset (Krizhevsky et al., 2009). We first generate embeddings from a pretrained g MLP-S model (Liu et al., 2021) on 224 224 images from the Image Net-1k dataset (Deng et al., 2009) with 16 16 patches (s = 196, d = 256).
Dataset Splits No The paper mentions using CIFAR-100 but does not specify the training, validation, and test split percentages or sample counts used for the experiments.
Hardware Specification Yes All heads were trained on two NVIDIA 1080 Ti GPUs using the Pytorch deep learning library (Paszke et al., 2019).
Software Dependencies No The paper mentions "Pytorch deep learning library (Paszke et al., 2019)" and "Pytorch Image Models library (Wightman, 2019)" but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes For all experiments, we trained each head for 70 epochs, and used a regularization parameter of β = 2 10 2, the Adam W optimizer (Loshchilov & Hutter, 2017), and a cosine learning rate schedule with a warmup of three epochs with warmup learning rate of 2 10 7, an initial learning rate chosen based on training accuracy of either 5 10 3 or 10 4, and a final learning rate of 2 10 2 times the initial learning rate. ...All heads aside from the self-attention head were trained using a batch size of 100, whereas the self-attention head was trained with a batch size of 20.