Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding

Authors: Shengjie Luo, Shanda Li, Tianle Cai, Di He, Dinglan Peng, Shuxin Zheng, Guolin Ke, Liwei Wang, Tie-Yan Liu

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On a wide range of tasks, we empirically show that our models can be trained from scratch without any optimization issues. The learned model performs better than many efficient Transformer variants and is faster than standard Transformer in the long-sequence regime. We conduct experiments on a wide range of tasks spanning language pre-training, language modeling, machine translation and image classification.
Researcher Affiliation Collaboration Shengjie Luo1 , Shanda Li2 , Tianle Cai4, Di He6 , Dinglan Peng5, Shuxin Zheng6 , Guolin Ke5, Liwei Wang1,2,3, Tie-Yan Liu6 1Center for Data Science, Peking University 2Key Laboratory of Machine Perception, MOE, School of EECS, Peking University 3Institute for Artificial Intelligence, Peking University 4Princeton University 5University of Science and Technology of China 6Microsoft Research
Pseudocode Yes A pseudocode implementation is provided in Algorithm 1. Algorithm 1 Efficient Normalized Kernelized Attention with RPE using FFT
Open Source Code No The paper does not provide an explicit statement or link for the open-sourcing of its code for the described methodology.
Open Datasets Yes We use the GLUE (General Language Understanding Evaluation) dataset [47] as the downstream tasks to evaluate the performance of the pre-trained models. We conduct experiments on Wiki Text-103 language modeling task... In machine translation, we evaluate our method on the widely used public dataset IWSLT14 German English and French English. We benchmark our method on Image Net-1K [8], which contains 1.28M training images and 50K validation images from 1,000 classes.
Dataset Splits Yes We benchmark our method on Image Net-1K [8], which contains 1.28M training images and 50K validation images from 1,000 classes. We conduct experiments to study whether using normalized attention can practically mitigate the performance drop issue observed in the previous works... Figure 2 summarizes the experiment results in terms of the validation BLEU scores with confidence intervals.
Hardware Specification Yes All models are run on 64 NVIDIA Tesla V100 GPUs with mixed-precision [26]. All models are trained on 8 NVIDIA Tesla V100 GPUs.
Software Dependencies No All codes are implemented based on fairseq [28] in Py Torch [31]. However, specific version numbers for these software dependencies are not provided.
Experiment Setup Yes We use the BERT-base architecture [9] in our experiments, which consists of 12 Transformer layers. For each layer, the hidden size is set to 768, and the number of attention heads is set to 12. Following [32], the sequence length is set to 512 during both training and evaluation. The model architecture consists of 6 decoder layers. The number of attention head is set to 8. The hidden dimension is set to 512. The dimension of feed-forward layer is set to 2048. The dropout ratio and the weight decay are set to 0.1 and 0.01, respectively. The batch size is set to 64. The feature map dimension is set to 64. We use Adam [19] as the optimizer, and set its hyperparameter ϵ to 1e 6 and (β1, β2) to (0.9, 0.98). The peak learning rate is set to 2e 3. The model is trained for 150k steps with a 6k-step warm-up stage followed by an inverse square-root learning rate scheduler.