reproducibilityindex.ai

Combiner: Full Attention Transformer with Sparse Computation Cost

Authors: Hongyu Ren, Hanjun Dai, Zihang Dai, Mengjiao Yang, Jure Leskovec, Dale Schuurmans, Bo Dai

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach, yielding state-of-the-art results on several image and text modeling tasks. We validate Combiner on both autoregressive and bidirectional sequence modeling tasks over a variety of domains including text and images.
Researcher Affiliation	Collaboration	Stanford University, {hyren,jure}@cs.stanford.edu Google Research, Brain Team, {hadai,zihangd,sherryy,schuurmans,bodai}@google.com University of Alberta
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	The implementation of Combiner can be found at https://github.com/google-research/google-research/tree/master/combiner.
Open Datasets	Yes	We evaluate Combiner with different full attention patterns on both autoregressive and bidirectional sequence modeling tasks, covering a wide range of input data from images to texts. We show that Combiner can achieve better perplexity and accuracy when using the same transformer architectures while being much faster in terms of runtime, and achieves state of the art performance on density estimation on standard datasets CIFAR-10 (2.77 bits/dim) and Image Net-64 (3.42 bits/dim), as well as the Long-Range Arena [31]. For language modeling, we focus on the Wiki-40BEn dataset [34]... Specifically, we use the large scale C4 dataset [8] for training and evaluation...
Dataset Splits	Yes	We first perform a sanity check where we compare sparse attention baselines against Combiner with full attention under the same architecture on the CIFAR-10 dataset. The sequence length is 3072. We also evaluate performance under the autoregressive setting on Image Net-64, where sequence length is 12,288. For language modeling, we focus on the Wiki-40BEn dataset [34]... Specifically, we use the large scale C4 dataset [8] for training and evaluation...
Hardware Specification	Yes	We run inference of all the models on a TPU v3-16 (16 cores x 16GB) with batch size 16...
Software Dependencies	No	The paper mentions that Combiner "can be easily implemented in common frameworks" and "GPU/TPU friendly", but it does not specify any software dependencies with version numbers (e.g., Python, TensorFlow, PyTorch versions).
Experiment Setup	Yes	We train all models for 500k iterations using batch size 32 on TPU v2. For all the methods, we use a same 6-layer transformer with 8 attention heads and 512 embedding dimensions. Following the 128-layer architecture in Child et al. [14], we apply Combiner-Axial and achieve state-of-the-art performance, 2.77 BPD on CIFAR-10.