Unveiling the Hidden Structure of Self-Attention via Kernel Principal Component Analysis
Authors: Rachel S.Y. Teo, Tan Nguyen
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate the advantages of RPC-Attention over softmax attention on the Image Net-1K object classification, Wiki Text-103 language modeling, and ADE20K image segmentation task. The code is publicly available at https://github.com/rachtsy/KPCA_code. 1 Introduction ... 4 Experimental Results We aim to numerically show that: (i) RPC-Attention achieves competitive or even better accuracy than the baseline softmax attention on clean data, and (ii) the advantages of RPC-Attention are more prominent when there is a contamination of samples across different types of data and a variety of tasks. |
| Researcher Affiliation | Academia | Rachel S.Y. Teo Department of Mathematics National University of Singapore rachel.tsy@u.nus.edu Tan M. Nguyen Department of Mathematics National University of Singapore tanmn@nus.edu.sg |
| Pseudocode | Yes | Algorithm 1 Principal Attention Pursuit (PAP) ... Algorithm 2 ADMM for Principal Components Pursuit |
| Open Source Code | Yes | The code is publicly available at https://github.com/rachtsy/KPCA_code. |
| Open Datasets | Yes | We use the Image Net-1K dataset that contains 1.28M training images and 50K validation images. ... We assess our model on the large-scale Wiki Text-103 language modeling task [47]. ... providing results on the ADE20K image segmentation task [95]. |
| Dataset Splits | Yes | We use the Image Net-1K dataset that contains 1.28M training images and 50K validation images. ... The training set contains about 28,000 articles, with a total of 103 million words. ... The validation and test sets have 218,000 and 246,000 words, respectively, with both sets comprising 60 articles and totaling about 268,000 words. |
| Hardware Specification | Yes | All of our results are averaged over 5 runs with different seeds and run on 4 A100 GPU. |
| Software Dependencies | No | The paper references specific code implementations from other researchers (e.g., 'https://github.com/facebookresearch/deit', 'https://github.com/rstrudel/segmenter', 'https://github.com/IDSIA/lmtool-fwp') but does not list specific version numbers for general software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | Our RPC-Sym Vi T models have 5.2M parameters, the same as the Sym Vi T baseline. We use a standard tiny configuration with 12 transformer layers, 3 attention heads per layer, and a model dimension of 192 and simply replace softmax attention with RPC-Attention. ... There are 3 hyperparameters: 1) µ: this parameter controls the singular value thresholding operator in the PAP algorithm. We set µ to the recommended value given in Definition 1; 2) λ: this is a regularization parameter that controls the sparsity of the corruption matrix S. We finetune λ for training and observe that RPC-Sym Vi T with λ = 3 yields the best performance for models with 2 iterations per layer and λ = 4 yields the best performance for models with iterations only in the first layer; 3) n: the number of iterations of the PAP algorithm in a RPC-Attention layer. ... The model has a dimension of 128 for the keys, queries and values, while the training and evaluation context length is set at 256. There are 16 layers altogether and 8 heads per layer. ... Both our RPC-FAN-tiny and the baseline FAN-tiny are trained for 300 epochs on the Image Net-1K object classification task. |