cosFormer: Rethinking Softmax In Attention
Authors: Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, Yiran Zhong
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on language modeling and text understanding tasks demonstrate the effectiveness of our method. We further examine our method on long sequences and achieve state-of-the-art performance on the Long-Range Arena benchmark. |
| Researcher Affiliation | Collaboration | 1Sense Time Research 2Shanghai AI Laboratory 3Australian National University 4Northwestern Polytechnical University 5The University of Hong Kong |
| Pseudocode | Yes | A.2 PSEUDO CODE OF COSFORMER Algorithm 1 describe the way to compute COSFORMER attention |
| Open Source Code | Yes | The source code is available at COSFORMER. |
| Open Datasets | Yes | We train our model... on the Wiki Text-103 (Merity et al., 2017)... We perform extensive experiments on both autoregressive language models and bidirectional models on five public benchmarks, including Wiki Text-103 (Merity et al., 2017), GLUE (Wang et al., 2018), IMDB (Maas et al., 2011), AMAZON (Ni et al., 2019) and Long-Range Arena benchmark (Tay et al., 2020b). |
| Dataset Splits | Yes | We train our model... on the Wiki Text-103 (Merity et al., 2017) and report perplexity on the validation and test splits in Table 2. (Table 7 also explicitly provides Train/Valid/Test counts for all datasets.) |
| Hardware Specification | Yes | We train our model on 8 Nvidia Tesla A100 GPUs with a sequence length of 512 for 150K updates... We train this bidirectional task on 2 Nvidia Tesla A100 GPUs... We conduct experiments on one Nvidia A6000 GPU... |
| Software Dependencies | No | We first implement our method on Jax (Bradbury et al., 2018). The paper mentions Jax but does not provide specific version numbers for software dependencies. |
| Experiment Setup | Yes | Specifically, we adopt their large model which has 16 cascaded layers with a projected dimensions of 1024, and replace the self-attention module with our proposed linear attention module. We train our model on 8 Nvidia Tesla A100 GPUs with a sequence length of 512 for 150K updates... We train this bidirectional task on 2 Nvidia Tesla A100 GPUs for 50K iterations with a input sequence length 512. |