Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers
Authors: Arda Sahiner, Tolga Ergen, Batu Ozturkler, John Pauly, Morteza Mardani, Mert Pilanci
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we seek to compare the performance of the transformer heads we have analyzed in this work to baseline convex optimization methods. This comparison allows us to illustrate the implicit biases imposed by these novel heads in a practical example. In particular, we consider the task of training a single new block of these convex heads for performing an image classification task. Specifically, we seek to classify images from the CIFAR-100 dataset (Krizhevsky et al., 2009). |
| Researcher Affiliation | Collaboration | Arda Sahiner 1 Tolga Ergen 1 Batu Ozturkler 1 John Pauly 1 Morteza Mardani 2 Mert Pilanci 1 1Department of Electrical Engineering, Stanford University, Stanford, CA, USA 2NVIDIA Corporation, Santa Clara, CA, USA. |
| Pseudocode | No | The paper contains mathematical derivations and descriptions of methods but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions using third-party libraries such as "Pytorch deep learning library" and "Pytorch Image Models library", but does not state that the source code for the methodology described in this paper is openly available or provide a link. |
| Open Datasets | Yes | Specifically, we seek to classify images from the CIFAR-100 dataset (Krizhevsky et al., 2009). We first generate embeddings from a pretrained g MLP-S model (Liu et al., 2021) on 224 224 images from the Image Net-1k dataset (Deng et al., 2009) with 16 16 patches (s = 196, d = 256). |
| Dataset Splits | No | The paper mentions using CIFAR-100 but does not specify the training, validation, and test split percentages or sample counts used for the experiments. |
| Hardware Specification | Yes | All heads were trained on two NVIDIA 1080 Ti GPUs using the Pytorch deep learning library (Paszke et al., 2019). |
| Software Dependencies | No | The paper mentions "Pytorch deep learning library (Paszke et al., 2019)" and "Pytorch Image Models library (Wightman, 2019)" but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | For all experiments, we trained each head for 70 epochs, and used a regularization parameter of β = 2 10 2, the Adam W optimizer (Loshchilov & Hutter, 2017), and a cosine learning rate schedule with a warmup of three epochs with warmup learning rate of 2 10 7, an initial learning rate chosen based on training accuracy of either 5 10 3 or 10 4, and a final learning rate of 2 10 2 times the initial learning rate. ...All heads aside from the self-attention head were trained using a batch size of 100, whereas the self-attention head was trained with a batch size of 20. |