reproducibilityindex.ai

Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers

Authors: Arda Sahiner, Tolga Ergen, Batu Ozturkler, John Pauly, Morteza Mardani, Mert Pilanci

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we seek to compare the performance of the transformer heads we have analyzed in this work to baseline convex optimization methods. This comparison allows us to illustrate the implicit biases imposed by these novel heads in a practical example. In particular, we consider the task of training a single new block of these convex heads for performing an image classiﬁcation task. Speciﬁcally, we seek to classify images from the CIFAR-100 dataset (Krizhevsky et al., 2009).
Researcher Affiliation	Collaboration	Arda Sahiner 1 Tolga Ergen 1 Batu Ozturkler 1 John Pauly 1 Morteza Mardani 2 Mert Pilanci 1 1Department of Electrical Engineering, Stanford University, Stanford, CA, USA 2NVIDIA Corporation, Santa Clara, CA, USA.
Pseudocode	No	The paper contains mathematical derivations and descriptions of methods but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions using third-party libraries such as "Pytorch deep learning library" and "Pytorch Image Models library", but does not state that the source code for the methodology described in this paper is openly available or provide a link.
Open Datasets	Yes	Speciﬁcally, we seek to classify images from the CIFAR-100 dataset (Krizhevsky et al., 2009). We ﬁrst generate embeddings from a pretrained g MLP-S model (Liu et al., 2021) on 224 224 images from the Image Net-1k dataset (Deng et al., 2009) with 16 16 patches (s = 196, d = 256).
Dataset Splits	No	The paper mentions using CIFAR-100 but does not specify the training, validation, and test split percentages or sample counts used for the experiments.
Hardware Specification	Yes	All heads were trained on two NVIDIA 1080 Ti GPUs using the Pytorch deep learning library (Paszke et al., 2019).
Software Dependencies	No	The paper mentions "Pytorch deep learning library (Paszke et al., 2019)" and "Pytorch Image Models library (Wightman, 2019)" but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	For all experiments, we trained each head for 70 epochs, and used a regularization parameter of β = 2 10 2, the Adam W optimizer (Loshchilov & Hutter, 2017), and a cosine learning rate schedule with a warmup of three epochs with warmup learning rate of 2 10 7, an initial learning rate chosen based on training accuracy of either 5 10 3 or 10 4, and a ﬁnal learning rate of 2 10 2 times the initial learning rate. ...All heads aside from the self-attention head were trained using a batch size of 100, whereas the self-attention head was trained with a batch size of 20.