Self-attention Networks Localize When QK-eigenspectrum Concentrates

Authors: Han Bao, Ryuichiro Hataya, Ryo Karakida

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Lastly, we indirectly observed the correlation of the eigenspectrum and the model performance in the experiments with the Wiki Text dataset (Merity et al., 2016) by introducing a regularization scheme called LOCATER.
Researcher Affiliation Academia 1Kyoto University 2RIKEN AIP 3AIST.
Pseudocode No The paper describes methods using mathematical equations and textual descriptions but does not include any explicit pseudocode or algorithm blocks.
Open Source Code No The paper mentions using 'fairseq v0.12.2' as a toolkit, but does not provide a specific link or explicit statement about making the source code for their own methodology available.
Open Datasets Yes The dataset we used is Wiki Text-2 (Merity et al., 2016), which is a collection of high-quality Wikipedia articles.
Dataset Splits No The paper includes figures showing results for 'Attn. entropy (val)' and 'Perplexity (val)', implying the use of a validation set, but does not provide specific details on the dataset split percentages or methodology for training, validation, and testing.
Hardware Specification No The paper states 'A part of the experiments of this research was conducted using Wisteria/Aquarius in the Information Technology Center, the University of Tokyo,' but does not provide specific hardware details such as GPU/CPU models or memory.
Software Dependencies Yes We used fairseq v0.12.2 (Ott et al., 2019), which is a toolkit oriented for sequence modeling, to implement and train transformers.
Experiment Setup Yes The model is a 1-layer transformer with a single-head self-attention and Post-LN (default), and the input embedding dimension, attention embedding dimension, and feed-forward net embedding dimension are set to 128 altogether (namely, d = 128). Input data were transformed into 64 tokens (namely, T = 64) with batch size 64. The optimizer is Adam (Kingma & Ba, 2015) with default parameters and no clip norm, and the weight decay with 0.01 is used. The learning rate is fixed to 2.5 10 5 without any scheduling.