Long-Short Transformer: Efficient Transformers for Language and Vision
Authors: Chen Zhu, Wei Ping, Chaowei Xiao, Mohammad Shoeybi, Tom Goldstein, Anima Anandkumar, Bryan Catanzaro
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and Image Net classification. For instance, Transformer-LS achieves 0.97 test BPC on enwik8 using half the number of parameters than previous method, while being faster and is able to handle 3 as long sequences compared to its full-attention version on the same hardware. On Image Net, it can obtain the state-of-the-art results (e.g., a moderate size of 55.8M model solely trained on 224 224 Image Net-1K can obtain Top-1 accuracy 84.1%), while being more scalable on high-resolution images. |
| Researcher Affiliation | Collaboration | Chen Zhu 1 , Wei Ping 2, Chaowei Xiao 2,3, Mohammad Shoeybi2, Tom Goldstein1, Anima Anandkumar2,4, and Bryan Catanzaro2 1University of Maryland, College Park 2 NVIDIA 3Arizona State University 4California Institute of Technology |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source code and models are released at https://github.com/NVIDIA/transformer-ls. |
| Open Datasets | Yes | We train and evaluate our model on enwik8 and text8, each with 100M characters and are divided into 90M, 5M, 5M for train, dev, test, following [48]. Our smaller 12-layer and larger 30-layer models are Pre-LN Transformers with the same width and depth as Longformer [20], except that we add relative position encoding to the projected segments in each layer. We adopt the cache mechanism of Transformer-XL [9], setting the cache size to be the same as the input sequence length. We follow similar training schedule as Longformer, and train our model in 3 phases with increasing sequence lengths. The input sequence lengths are 2048, 4096 and 8192 respectively for the 3 phases. By comparison, Longformer trains their model in 5 phases on GPUs with 48GB memory (The maximal of ours is 32GB) where the sequence length is 23,040 in the last phase. The window size of Longformer increases with depth and its average window size is 4352 in phase 5, while our effective number of attended tokens is 1280 on average in the last phase. Each experiment takes around 8 days to finish on 8 V100 GPUs. Detailed hyperparameters are shown in Appendix D. For testing, same as Longformer, we split the dataset into overlapping sequences of length 32K at a step size of 512, and evaluate the BPCs for predicting the next 512 tokens given the previous 32K characters. |
| Dataset Splits | Yes | We train and evaluate our model on enwik8 and text8, each with 100M characters and are divided into 90M, 5M, 5M for train, dev, test, following [48]. |
| Hardware Specification | Yes | Each experiment takes around 8 days to finish on 8 V100 GPUs. |
| Software Dependencies | No | The paper mentions using 'Py Torch for implementation and count the FLOPs using fvcore [45]' but does not specify version numbers for these software components. |
| Experiment Setup | Yes | We follow similar training schedule as Longformer, and train our model in 3 phases with increasing sequence lengths. The input sequence lengths are 2048, 4096 and 8192 respectively for the 3 phases. Detailed hyperparameters are shown in Appendix D. |