Unveiling the Hidden Structure of Self-Attention via Kernel Principal Component Analysis

Authors: Rachel S.Y. Teo, Tan Nguyen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate the advantages of RPC-Attention over softmax attention on the Image Net-1K object classification, Wiki Text-103 language modeling, and ADE20K image segmentation task. The code is publicly available at https://github.com/rachtsy/KPCA_code. 1 Introduction ... 4 Experimental Results We aim to numerically show that: (i) RPC-Attention achieves competitive or even better accuracy than the baseline softmax attention on clean data, and (ii) the advantages of RPC-Attention are more prominent when there is a contamination of samples across different types of data and a variety of tasks.
Researcher Affiliation Academia Rachel S.Y. Teo Department of Mathematics National University of Singapore rachel.tsy@u.nus.edu Tan M. Nguyen Department of Mathematics National University of Singapore tanmn@nus.edu.sg
Pseudocode Yes Algorithm 1 Principal Attention Pursuit (PAP) ... Algorithm 2 ADMM for Principal Components Pursuit
Open Source Code Yes The code is publicly available at https://github.com/rachtsy/KPCA_code.
Open Datasets Yes We use the Image Net-1K dataset that contains 1.28M training images and 50K validation images. ... We assess our model on the large-scale Wiki Text-103 language modeling task [47]. ... providing results on the ADE20K image segmentation task [95].
Dataset Splits Yes We use the Image Net-1K dataset that contains 1.28M training images and 50K validation images. ... The training set contains about 28,000 articles, with a total of 103 million words. ... The validation and test sets have 218,000 and 246,000 words, respectively, with both sets comprising 60 articles and totaling about 268,000 words.
Hardware Specification Yes All of our results are averaged over 5 runs with different seeds and run on 4 A100 GPU.
Software Dependencies No The paper references specific code implementations from other researchers (e.g., 'https://github.com/facebookresearch/deit', 'https://github.com/rstrudel/segmenter', 'https://github.com/IDSIA/lmtool-fwp') but does not list specific version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes Our RPC-Sym Vi T models have 5.2M parameters, the same as the Sym Vi T baseline. We use a standard tiny configuration with 12 transformer layers, 3 attention heads per layer, and a model dimension of 192 and simply replace softmax attention with RPC-Attention. ... There are 3 hyperparameters: 1) µ: this parameter controls the singular value thresholding operator in the PAP algorithm. We set µ to the recommended value given in Definition 1; 2) λ: this is a regularization parameter that controls the sparsity of the corruption matrix S. We finetune λ for training and observe that RPC-Sym Vi T with λ = 3 yields the best performance for models with 2 iterations per layer and λ = 4 yields the best performance for models with iterations only in the first layer; 3) n: the number of iterations of the PAP algorithm in a RPC-Attention layer. ... The model has a dimension of 128 for the keys, queries and values, while the training and evaluation context length is set at 256. There are 16 layers altogether and 8 heads per layer. ... Both our RPC-FAN-tiny and the baseline FAN-tiny are trained for 300 epochs on the Image Net-1K object classification task.