Max-Margin Token Selection in Attention Mechanism
Authors: Davoud Ataee Tarzanagh, Yingcong Li, Xuechen Zhang, Samet Oymak
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we verify our theoretical findings via numerical experiments and provide insights. 4 Experiments |
| Researcher Affiliation | Academia | Davoud Ataee Tarzanagh University of Pennsylvania tarzanaq@upenn.edu Yingcong Li Xuechen Zhang University of California, Riverside {yli692,xzhan394}@ucr.edu Samet Oymak University of Michigan UC Riverside oymak@umich.edu |
| Pseudocode | No | The paper describes algorithms and mathematical formulations but does not contain any explicitly labeled "Pseudocode" or "Algorithm" blocks. |
| Open Source Code | Yes | The code for experiments can be found at https://github.com/ucr-optml/max_margin_attention. |
| Open Datasets | Yes | To study softmax sparsity and the evolution of attention weights throughout training, we train a vision transformer (Vi T-base) model [23] from scratch, utilizing the CIFAR10 dataset [24] for 400 epochs with fixed learning rate 3 10 3. [24] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The cifar-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html, 55(5), 2014. |
| Dataset Splits | No | The paper mentions using the CIFAR-10 dataset but does not explicitly describe the training, validation, and test splits with specific percentages or sample counts. |
| Hardware Specification | No | The paper describes the experiments but does not specify any particular hardware used (e.g., GPU models, CPU types, or cloud compute instances). |
| Software Dependencies | No | The paper mentions using PyTorch for implementation but does not specify any software dependencies with version numbers. |
| Experiment Setup | Yes | During training, we use SGD optimizer with learning rate 0.1 and train the model for 1000 iterations. To study softmax sparsity and the evolution of attention weights throughout training, we train a vision transformer (Vi T-base) model [23] from scratch, utilizing the CIFAR10 dataset [24] for 400 epochs with fixed learning rate 3 10 3. |