reproducibilityindex.ai

Long-Short Transformer: Efficient Transformers for Language and Vision

Authors: Chen Zhu, Wei Ping, Chaowei Xiao, Mohammad Shoeybi, Tom Goldstein, Anima Anandkumar, Bryan Catanzaro

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and Image Net classiﬁcation. For instance, Transformer-LS achieves 0.97 test BPC on enwik8 using half the number of parameters than previous method, while being faster and is able to handle 3 as long sequences compared to its full-attention version on the same hardware. On Image Net, it can obtain the state-of-the-art results (e.g., a moderate size of 55.8M model solely trained on 224 224 Image Net-1K can obtain Top-1 accuracy 84.1%), while being more scalable on high-resolution images.
Researcher Affiliation	Collaboration	Chen Zhu 1 , Wei Ping 2, Chaowei Xiao 2,3, Mohammad Shoeybi2, Tom Goldstein1, Anima Anandkumar2,4, and Bryan Catanzaro2 1University of Maryland, College Park 2 NVIDIA 3Arizona State University 4California Institute of Technology
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	The source code and models are released at https://github.com/NVIDIA/transformer-ls.
Open Datasets	Yes	We train and evaluate our model on enwik8 and text8, each with 100M characters and are divided into 90M, 5M, 5M for train, dev, test, following [48]. Our smaller 12-layer and larger 30-layer models are Pre-LN Transformers with the same width and depth as Longformer [20], except that we add relative position encoding to the projected segments in each layer. We adopt the cache mechanism of Transformer-XL [9], setting the cache size to be the same as the input sequence length. We follow similar training schedule as Longformer, and train our model in 3 phases with increasing sequence lengths. The input sequence lengths are 2048, 4096 and 8192 respectively for the 3 phases. By comparison, Longformer trains their model in 5 phases on GPUs with 48GB memory (The maximal of ours is 32GB) where the sequence length is 23,040 in the last phase. The window size of Longformer increases with depth and its average window size is 4352 in phase 5, while our effective number of attended tokens is 1280 on average in the last phase. Each experiment takes around 8 days to ﬁnish on 8 V100 GPUs. Detailed hyperparameters are shown in Appendix D. For testing, same as Longformer, we split the dataset into overlapping sequences of length 32K at a step size of 512, and evaluate the BPCs for predicting the next 512 tokens given the previous 32K characters.
Dataset Splits	Yes	We train and evaluate our model on enwik8 and text8, each with 100M characters and are divided into 90M, 5M, 5M for train, dev, test, following [48].
Hardware Specification	Yes	Each experiment takes around 8 days to ﬁnish on 8 V100 GPUs.
Software Dependencies	No	The paper mentions using 'Py Torch for implementation and count the FLOPs using fvcore [45]' but does not specify version numbers for these software components.
Experiment Setup	Yes	We follow similar training schedule as Longformer, and train our model in 3 phases with increasing sequence lengths. The input sequence lengths are 2048, 4096 and 8192 respectively for the 3 phases. Detailed hyperparameters are shown in Appendix D.