Jump Self-attention: Capturing High-order Statistics in Transformers

Authors: Haoyi Zhou, Siyang Xiao, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental With extensive experiments, we empirically show that our methods significantly increase the performance on ten different tasks.
Researcher Affiliation Academia Haoyi Zhou BDBC Beihang University Beijing, China 100191 haoyi@buaa.edu.cn Siyang Xiao BDBC Beihang University Beijing, China 100191 xiaosy@act.buaa.edu.cn Shanghang Zhang School of Computer Science Peking University Beijing, China 100871 shanghang@pku.edu.cn Jieqi Peng BDBC Beihang University Beijing, China 100191 pengjq@act.buaa.edu.cn Shuai Zhang BDBC Beihang University Beijing, China 100191 zhangs@act.buaa.edu.cn BDBC Beihang University Beijing, China 100191 lijx@buaa.edu.cn
Pseudocode No The paper describes the proposed method conceptually and mathematically but does not present any structured pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/zhouhaoyi/JAT2022.
Open Datasets Yes We conduct JAT s experiments of the language understanding and generalization capabilities on the General Language Understanding Evaluation (GLUE) benchmark [21], a collection of diverse natural language understanding tasks. We also perform additional experiments on Super GLUE benchmark [20]. Settings: We use a batch size of 32 and fine-tune for 5 epochs over the data for nine GLUE tasks. The threshold ρ is selected from {4.0, 4.5, . . . , 10.0}. The other settings follow the recommendation of the original paper. Since the proposed JAT can be used interchangeably with canonical self-attention, we perform a grid search on the layer replacement. There are two sets of layer deployment, where the first combination is chosen from {Layer1 4, Layer5 8, Layer9 12} and the alternative is {Layer1 6, Layer7 12}. Another important selection is the multi-heads grouping, we employ the side-by-side strategy as replacing the heads {2, 4, 6, 8, 10} with JAT. And we do not use any ensembling strategy or multi-tasking scheme in this fine-tuning. The evaluation is performed on the Dev set. Metric: We use three different evaluation metrics on the 9 tasks.
Dataset Splits Yes The evaluation is performed on the Dev set.
Hardware Specification Yes Platform: Intel Xeon 3.2GHz + The Nvidia V100 GPU (32 GB) X 4.
Software Dependencies No The paper mentions software like BERT, RoBERTa, and Mind Spore but does not provide specific version numbers for these or other ancillary software components (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes Settings: We use a batch size of 32 and fine-tune for 5 epochs over the data for nine GLUE tasks. The threshold ρ is selected from {4.0, 4.5, . . . , 10.0}.