reproducibilityindex.ai

Max-Margin Token Selection in Attention Mechanism

Authors: Davoud Ataee Tarzanagh, Yingcong Li, Xuechen Zhang, Samet Oymak

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we verify our theoretical findings via numerical experiments and provide insights. 4 Experiments
Researcher Affiliation	Academia	Davoud Ataee Tarzanagh University of Pennsylvania tarzanaq@upenn.edu Yingcong Li Xuechen Zhang University of California, Riverside {yli692,xzhan394}@ucr.edu Samet Oymak University of Michigan UC Riverside oymak@umich.edu
Pseudocode	No	The paper describes algorithms and mathematical formulations but does not contain any explicitly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code	Yes	The code for experiments can be found at https://github.com/ucr-optml/max_margin_attention.
Open Datasets	Yes	To study softmax sparsity and the evolution of attention weights throughout training, we train a vision transformer (Vi T-base) model [23] from scratch, utilizing the CIFAR10 dataset [24] for 400 epochs with fixed learning rate 3 10 3. [24] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The cifar-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html, 55(5), 2014.
Dataset Splits	No	The paper mentions using the CIFAR-10 dataset but does not explicitly describe the training, validation, and test splits with specific percentages or sample counts.
Hardware Specification	No	The paper describes the experiments but does not specify any particular hardware used (e.g., GPU models, CPU types, or cloud compute instances).
Software Dependencies	No	The paper mentions using PyTorch for implementation but does not specify any software dependencies with version numbers.
Experiment Setup	Yes	During training, we use SGD optimizer with learning rate 0.1 and train the model for 1000 iterations. To study softmax sparsity and the evolution of attention weights throughout training, we train a vision transformer (Vi T-base) model [23] from scratch, utilizing the CIFAR10 dataset [24] for 400 epochs with fixed learning rate 3 10 3.