cosFormer: Rethinking Softmax In Attention

Authors: Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, Yiran Zhong

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on language modeling and text understanding tasks demonstrate the effectiveness of our method. We further examine our method on long sequences and achieve state-of-the-art performance on the Long-Range Arena benchmark.
Researcher Affiliation Collaboration 1Sense Time Research 2Shanghai AI Laboratory 3Australian National University 4Northwestern Polytechnical University 5The University of Hong Kong
Pseudocode Yes A.2 PSEUDO CODE OF COSFORMER Algorithm 1 describe the way to compute COSFORMER attention
Open Source Code Yes The source code is available at COSFORMER.
Open Datasets Yes We train our model... on the Wiki Text-103 (Merity et al., 2017)... We perform extensive experiments on both autoregressive language models and bidirectional models on five public benchmarks, including Wiki Text-103 (Merity et al., 2017), GLUE (Wang et al., 2018), IMDB (Maas et al., 2011), AMAZON (Ni et al., 2019) and Long-Range Arena benchmark (Tay et al., 2020b).
Dataset Splits Yes We train our model... on the Wiki Text-103 (Merity et al., 2017) and report perplexity on the validation and test splits in Table 2. (Table 7 also explicitly provides Train/Valid/Test counts for all datasets.)
Hardware Specification Yes We train our model on 8 Nvidia Tesla A100 GPUs with a sequence length of 512 for 150K updates... We train this bidirectional task on 2 Nvidia Tesla A100 GPUs... We conduct experiments on one Nvidia A6000 GPU...
Software Dependencies No We first implement our method on Jax (Bradbury et al., 2018). The paper mentions Jax but does not provide specific version numbers for software dependencies.
Experiment Setup Yes Specifically, we adopt their large model which has 16 cascaded layers with a projected dimensions of 1024, and replace the self-attention module with our proposed linear attention module. We train our model on 8 Nvidia Tesla A100 GPUs with a sequence length of 512 for 150K updates... We train this bidirectional task on 2 Nvidia Tesla A100 GPUs for 50K iterations with a input sequence length 512.