reproducibilityindex.ai

Unveiling the Hidden Structure of Self-Attention via Kernel Principal Component Analysis

Authors: Rachel S.Y. Teo, Tan Nguyen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate the advantages of RPC-Attention over softmax attention on the Image Net-1K object classification, Wiki Text-103 language modeling, and ADE20K image segmentation task. The code is publicly available at https://github.com/rachtsy/KPCA_code. 1 Introduction ... 4 Experimental Results We aim to numerically show that: (i) RPC-Attention achieves competitive or even better accuracy than the baseline softmax attention on clean data, and (ii) the advantages of RPC-Attention are more prominent when there is a contamination of samples across different types of data and a variety of tasks.
Researcher Affiliation	Academia	Rachel S.Y. Teo Department of Mathematics National University of Singapore rachel.tsy@u.nus.edu Tan M. Nguyen Department of Mathematics National University of Singapore tanmn@nus.edu.sg
Pseudocode	Yes	Algorithm 1 Principal Attention Pursuit (PAP) ... Algorithm 2 ADMM for Principal Components Pursuit
Open Source Code	Yes	The code is publicly available at https://github.com/rachtsy/KPCA_code.
Open Datasets	Yes	We use the Image Net-1K dataset that contains 1.28M training images and 50K validation images. ... We assess our model on the large-scale Wiki Text-103 language modeling task [47]. ... providing results on the ADE20K image segmentation task [95].
Dataset Splits	Yes	We use the Image Net-1K dataset that contains 1.28M training images and 50K validation images. ... The training set contains about 28,000 articles, with a total of 103 million words. ... The validation and test sets have 218,000 and 246,000 words, respectively, with both sets comprising 60 articles and totaling about 268,000 words.
Hardware Specification	Yes	All of our results are averaged over 5 runs with different seeds and run on 4 A100 GPU.
Software Dependencies	No	The paper references specific code implementations from other researchers (e.g., 'https://github.com/facebookresearch/deit', 'https://github.com/rstrudel/segmenter', 'https://github.com/IDSIA/lmtool-fwp') but does not list specific version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	Our RPC-Sym Vi T models have 5.2M parameters, the same as the Sym Vi T baseline. We use a standard tiny configuration with 12 transformer layers, 3 attention heads per layer, and a model dimension of 192 and simply replace softmax attention with RPC-Attention. ... There are 3 hyperparameters: 1) µ: this parameter controls the singular value thresholding operator in the PAP algorithm. We set µ to the recommended value given in Definition 1; 2) λ: this is a regularization parameter that controls the sparsity of the corruption matrix S. We finetune λ for training and observe that RPC-Sym Vi T with λ = 3 yields the best performance for models with 2 iterations per layer and λ = 4 yields the best performance for models with iterations only in the first layer; 3) n: the number of iterations of the PAP algorithm in a RPC-Attention layer. ... The model has a dimension of 128 for the keys, queries and values, while the training and evaluation context length is set at 256. There are 16 layers altogether and 8 heads per layer. ... Both our RPC-FAN-tiny and the baseline FAN-tiny are trained for 300 epochs on the Image Net-1K object classification task.