Jump Self-attention: Capturing High-order Statistics in Transformers
Authors: Haoyi Zhou, Siyang Xiao, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | With extensive experiments, we empirically show that our methods significantly increase the performance on ten different tasks. |
| Researcher Affiliation | Academia | Haoyi Zhou BDBC Beihang University Beijing, China 100191 haoyi@buaa.edu.cn Siyang Xiao BDBC Beihang University Beijing, China 100191 xiaosy@act.buaa.edu.cn Shanghang Zhang School of Computer Science Peking University Beijing, China 100871 shanghang@pku.edu.cn Jieqi Peng BDBC Beihang University Beijing, China 100191 pengjq@act.buaa.edu.cn Shuai Zhang BDBC Beihang University Beijing, China 100191 zhangs@act.buaa.edu.cn BDBC Beihang University Beijing, China 100191 lijx@buaa.edu.cn |
| Pseudocode | No | The paper describes the proposed method conceptually and mathematically but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/zhouhaoyi/JAT2022. |
| Open Datasets | Yes | We conduct JAT s experiments of the language understanding and generalization capabilities on the General Language Understanding Evaluation (GLUE) benchmark [21], a collection of diverse natural language understanding tasks. We also perform additional experiments on Super GLUE benchmark [20]. Settings: We use a batch size of 32 and fine-tune for 5 epochs over the data for nine GLUE tasks. The threshold ρ is selected from {4.0, 4.5, . . . , 10.0}. The other settings follow the recommendation of the original paper. Since the proposed JAT can be used interchangeably with canonical self-attention, we perform a grid search on the layer replacement. There are two sets of layer deployment, where the first combination is chosen from {Layer1 4, Layer5 8, Layer9 12} and the alternative is {Layer1 6, Layer7 12}. Another important selection is the multi-heads grouping, we employ the side-by-side strategy as replacing the heads {2, 4, 6, 8, 10} with JAT. And we do not use any ensembling strategy or multi-tasking scheme in this fine-tuning. The evaluation is performed on the Dev set. Metric: We use three different evaluation metrics on the 9 tasks. |
| Dataset Splits | Yes | The evaluation is performed on the Dev set. |
| Hardware Specification | Yes | Platform: Intel Xeon 3.2GHz + The Nvidia V100 GPU (32 GB) X 4. |
| Software Dependencies | No | The paper mentions software like BERT, RoBERTa, and Mind Spore but does not provide specific version numbers for these or other ancillary software components (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | Settings: We use a batch size of 32 and fine-tune for 5 epochs over the data for nine GLUE tasks. The threshold ρ is selected from {4.0, 4.5, . . . , 10.0}. |