reproducibilityindex.ai

Self-attention Networks Localize When QK-eigenspectrum Concentrates

Authors: Han Bao, Ryuichiro Hataya, Ryo Karakida

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Lastly, we indirectly observed the correlation of the eigenspectrum and the model performance in the experiments with the Wiki Text dataset (Merity et al., 2016) by introducing a regularization scheme called LOCATER.
Researcher Affiliation	Academia	1Kyoto University 2RIKEN AIP 3AIST.
Pseudocode	No	The paper describes methods using mathematical equations and textual descriptions but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions using 'fairseq v0.12.2' as a toolkit, but does not provide a specific link or explicit statement about making the source code for their own methodology available.
Open Datasets	Yes	The dataset we used is Wiki Text-2 (Merity et al., 2016), which is a collection of high-quality Wikipedia articles.
Dataset Splits	No	The paper includes figures showing results for 'Attn. entropy (val)' and 'Perplexity (val)', implying the use of a validation set, but does not provide specific details on the dataset split percentages or methodology for training, validation, and testing.
Hardware Specification	No	The paper states 'A part of the experiments of this research was conducted using Wisteria/Aquarius in the Information Technology Center, the University of Tokyo,' but does not provide specific hardware details such as GPU/CPU models or memory.
Software Dependencies	Yes	We used fairseq v0.12.2 (Ott et al., 2019), which is a toolkit oriented for sequence modeling, to implement and train transformers.
Experiment Setup	Yes	The model is a 1-layer transformer with a single-head self-attention and Post-LN (default), and the input embedding dimension, attention embedding dimension, and feed-forward net embedding dimension are set to 128 altogether (namely, d = 128). Input data were transformed into 64 tokens (namely, T = 64) with batch size 64. The optimizer is Adam (Kingma & Ba, 2015) with default parameters and no clip norm, and the weight decay with 0.01 is used. The learning rate is fixed to 2.5 10 5 without any scheduling.