Self-attention Networks Localize When QK-eigenspectrum Concentrates
Authors: Han Bao, Ryuichiro Hataya, Ryo Karakida
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Lastly, we indirectly observed the correlation of the eigenspectrum and the model performance in the experiments with the Wiki Text dataset (Merity et al., 2016) by introducing a regularization scheme called LOCATER. |
| Researcher Affiliation | Academia | 1Kyoto University 2RIKEN AIP 3AIST. |
| Pseudocode | No | The paper describes methods using mathematical equations and textual descriptions but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions using 'fairseq v0.12.2' as a toolkit, but does not provide a specific link or explicit statement about making the source code for their own methodology available. |
| Open Datasets | Yes | The dataset we used is Wiki Text-2 (Merity et al., 2016), which is a collection of high-quality Wikipedia articles. |
| Dataset Splits | No | The paper includes figures showing results for 'Attn. entropy (val)' and 'Perplexity (val)', implying the use of a validation set, but does not provide specific details on the dataset split percentages or methodology for training, validation, and testing. |
| Hardware Specification | No | The paper states 'A part of the experiments of this research was conducted using Wisteria/Aquarius in the Information Technology Center, the University of Tokyo,' but does not provide specific hardware details such as GPU/CPU models or memory. |
| Software Dependencies | Yes | We used fairseq v0.12.2 (Ott et al., 2019), which is a toolkit oriented for sequence modeling, to implement and train transformers. |
| Experiment Setup | Yes | The model is a 1-layer transformer with a single-head self-attention and Post-LN (default), and the input embedding dimension, attention embedding dimension, and feed-forward net embedding dimension are set to 128 altogether (namely, d = 128). Input data were transformed into 64 tokens (namely, T = 64) with batch size 64. The optimizer is Adam (Kingma & Ba, 2015) with default parameters and no clip norm, and the weight decay with 0.01 is used. The learning rate is fixed to 2.5 10 5 without any scheduling. |