reproducibilityindex.ai

cosFormer: Rethinking Softmax In Attention

Authors: Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, Yiran Zhong

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on language modeling and text understanding tasks demonstrate the effectiveness of our method. We further examine our method on long sequences and achieve state-of-the-art performance on the Long-Range Arena benchmark.
Researcher Affiliation	Collaboration	1Sense Time Research 2Shanghai AI Laboratory 3Australian National University 4Northwestern Polytechnical University 5The University of Hong Kong
Pseudocode	Yes	A.2 PSEUDO CODE OF COSFORMER Algorithm 1 describe the way to compute COSFORMER attention
Open Source Code	Yes	The source code is available at COSFORMER.
Open Datasets	Yes	We train our model... on the Wiki Text-103 (Merity et al., 2017)... We perform extensive experiments on both autoregressive language models and bidirectional models on ﬁve public benchmarks, including Wiki Text-103 (Merity et al., 2017), GLUE (Wang et al., 2018), IMDB (Maas et al., 2011), AMAZON (Ni et al., 2019) and Long-Range Arena benchmark (Tay et al., 2020b).
Dataset Splits	Yes	We train our model... on the Wiki Text-103 (Merity et al., 2017) and report perplexity on the validation and test splits in Table 2. (Table 7 also explicitly provides Train/Valid/Test counts for all datasets.)
Hardware Specification	Yes	We train our model on 8 Nvidia Tesla A100 GPUs with a sequence length of 512 for 150K updates... We train this bidirectional task on 2 Nvidia Tesla A100 GPUs... We conduct experiments on one Nvidia A6000 GPU...
Software Dependencies	No	We ﬁrst implement our method on Jax (Bradbury et al., 2018). The paper mentions Jax but does not provide specific version numbers for software dependencies.
Experiment Setup	Yes	Speciﬁcally, we adopt their large model which has 16 cascaded layers with a projected dimensions of 1024, and replace the self-attention module with our proposed linear attention module. We train our model on 8 Nvidia Tesla A100 GPUs with a sequence length of 512 for 150K updates... We train this bidirectional task on 2 Nvidia Tesla A100 GPUs for 50K iterations with a input sequence length 512.