Max-Margin Token Selection in Attention Mechanism

Authors: Davoud Ataee Tarzanagh, Yingcong Li, Xuechen Zhang, Samet Oymak

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we verify our theoretical findings via numerical experiments and provide insights. 4 Experiments
Researcher Affiliation Academia Davoud Ataee Tarzanagh University of Pennsylvania tarzanaq@upenn.edu Yingcong Li Xuechen Zhang University of California, Riverside {yli692,xzhan394}@ucr.edu Samet Oymak University of Michigan UC Riverside oymak@umich.edu
Pseudocode No The paper describes algorithms and mathematical formulations but does not contain any explicitly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code Yes The code for experiments can be found at https://github.com/ucr-optml/max_margin_attention.
Open Datasets Yes To study softmax sparsity and the evolution of attention weights throughout training, we train a vision transformer (Vi T-base) model [23] from scratch, utilizing the CIFAR10 dataset [24] for 400 epochs with fixed learning rate 3 10 3. [24] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The cifar-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html, 55(5), 2014.
Dataset Splits No The paper mentions using the CIFAR-10 dataset but does not explicitly describe the training, validation, and test splits with specific percentages or sample counts.
Hardware Specification No The paper describes the experiments but does not specify any particular hardware used (e.g., GPU models, CPU types, or cloud compute instances).
Software Dependencies No The paper mentions using PyTorch for implementation but does not specify any software dependencies with version numbers.
Experiment Setup Yes During training, we use SGD optimizer with learning rate 0.1 and train the model for 1000 iterations. To study softmax sparsity and the evolution of attention weights throughout training, we train a vision transformer (Vi T-base) model [23] from scratch, utilizing the CIFAR10 dataset [24] for 400 epochs with fixed learning rate 3 10 3.