Primal-Attention: Self-attention through Asymmetric Kernel SVD in Primal Representation

Authors: Yingyi Chen, Qinghua Tao, Francesco Tonin, Johan Suykens

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerical experiments show state-of-the-art performance of our Primal-Attention with improved efficiency. Moreover, we demonstrate that the deployed KSVD optimization regularizes Primal-Attention with a sharper singular value decay than that of the canonical self-attention, further verifying the great potential of our method. To the best of our knowledge, this is the first work that provides a primal-dual representation for the asymmetric kernel in self-attention and successfully applies it to modelling and optimization2. In numerical experiments, Primal-Attention achieves state-of-the-art performance on various datasets together with efficiency advantages over the canonical self-attention.
Researcher Affiliation Academia Yingyi Chen ESAT-STADIUS KU Leuven, Belgium yingyi.chen@esat.kuleuven.be Qinghua Tao ESAT-STADIUS KU Leuven, Belgium qinghua.tao@esat.kuleuven.be Francesco Tonin ESAT-STADIUS KU Leuven, Belgium francesco.tonin@esat.kuleuven.be Johan A.K. Suykens ESAT-STADIUS KU Leuven, Belgium johan.suykens@esat.kuleuven.be
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our implementation is available at https://github.com/yingyichen-cyy/Primal Attention
Open Datasets Yes UEA Time Series Classification Archive [31] is the benchmark for the evaluation on temporal sequences. Long-Range Arena (LRA) [39] is a benchmark for the long-sequence scenarios. We consider the offline reinforcement learning (RL) performance of our methods on D4RL benchmark [47] designed for continuous control tasks. We evaluate the capability of our Primal.+ model with Dei T-Small/16 [7] as backbone on Image Net-100 [48] and Image Net-1K [23] for image classification task. We also experiment with the language modelling task on Wiki Text-103 [49].
Dataset Splits No The paper mentions using several benchmarks and following existing settings or training protocols, but it does not explicitly provide the exact percentages or counts for training, validation, and test splits within the paper's text.
Hardware Specification Yes Experiments are run on one NVIDIA Tesla P100 SXM2 16GB GPU. Experiments are conducted on a single NVIDIA Tesla V100 SXM2 32GB GPU. Each experiment is run with 3 different seeds on one NVIDIA Tesla P100 SXM2 16GB GPU. On Image Net, we train Dei T-Small/16 and our Primal.+Dei T-Small/16 from scratch following the same training protocols in [7] with 4 NVIDIA Tesla V100 SXM2 32GB GPUs. Models are trained from scratch on 4 NVIDIA Tesla V100 SXM2 32GB GPUs for 150K updates after 6K-steps warm-up.
Software Dependencies No The paper mentions using "Py Torch" for the Long-Range Arena Benchmark but does not specify a version number or list any other software dependencies with their versions.
Experiment Setup Yes The two main hyper-parameters of our method are the coefficient η in (10) and the number of projection directions s of KSVD in (6). The hyper-parameter search of our method is with η {0.1, 0.2, 0.5}, s {20, 30, 40}. We employ 2-layer Transformer as backbone with the hidden dimension 512 on 8 heads and the embedding dimension 64 for self-attention. Our hyper-parameter is set from η {0.05, 0.1}, s {20, 30}. The Transformer backbone is set with 2 layers, hidden dimension 128 with 2 heads and embedding dimension 64 with mean pooling. We adopt the architecture of 3 layers, hidden dimension 256 with 4 heads, and the embedding dimension 64. Our hyper-parameters are set as η = 0.05, s {32, 64, 96}. Our hyper-parameters are chosen as η = 0.05, s {32, 64, 96}. On Wiki Text-103, we follow the setting in [50] where the sequence length is set as 512, the model consists of 6 decoder layers with 8 heads.