Combiner: Full Attention Transformer with Sparse Computation Cost
Authors: Hongyu Ren, Hanjun Dai, Zihang Dai, Mengjiao Yang, Jure Leskovec, Dale Schuurmans, Bo Dai
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach, yielding state-of-the-art results on several image and text modeling tasks. We validate Combiner on both autoregressive and bidirectional sequence modeling tasks over a variety of domains including text and images. |
| Researcher Affiliation | Collaboration | Stanford University, {hyren,jure}@cs.stanford.edu Google Research, Brain Team, {hadai,zihangd,sherryy,schuurmans,bodai}@google.com University of Alberta |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | The implementation of Combiner can be found at https://github.com/google-research/google-research/tree/master/combiner. |
| Open Datasets | Yes | We evaluate Combiner with different full attention patterns on both autoregressive and bidirectional sequence modeling tasks, covering a wide range of input data from images to texts. We show that Combiner can achieve better perplexity and accuracy when using the same transformer architectures while being much faster in terms of runtime, and achieves state of the art performance on density estimation on standard datasets CIFAR-10 (2.77 bits/dim) and Image Net-64 (3.42 bits/dim), as well as the Long-Range Arena [31]. For language modeling, we focus on the Wiki-40BEn dataset [34]... Specifically, we use the large scale C4 dataset [8] for training and evaluation... |
| Dataset Splits | Yes | We first perform a sanity check where we compare sparse attention baselines against Combiner with full attention under the same architecture on the CIFAR-10 dataset. The sequence length is 3072. We also evaluate performance under the autoregressive setting on Image Net-64, where sequence length is 12,288. For language modeling, we focus on the Wiki-40BEn dataset [34]... Specifically, we use the large scale C4 dataset [8] for training and evaluation... |
| Hardware Specification | Yes | We run inference of all the models on a TPU v3-16 (16 cores x 16GB) with batch size 16... |
| Software Dependencies | No | The paper mentions that Combiner "can be easily implemented in common frameworks" and "GPU/TPU friendly", but it does not specify any software dependencies with version numbers (e.g., Python, TensorFlow, PyTorch versions). |
| Experiment Setup | Yes | We train all models for 500k iterations using batch size 32 on TPU v2. For all the methods, we use a same 6-layer transformer with 8 attention heads and 512 embedding dimensions. Following the 128-layer architecture in Child et al. [14], we apply Combiner-Axial and achieve state-of-the-art performance, 2.77 BPD on CIFAR-10. |