Transformers meet Stochastic Block Models: Attention with Data-Adaptive Sparsity and Cost
Authors: Sungjun Cho, Seonwoo Min, Jinwoo Kim, Moontae Lee, Honglak Lee, Seunghoon Hong
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations under the LRA and GLUE benchmarks demonstrate that our model outperforms previous efficient variants as well as the original Transformer with full attention. For empirical evaluations, we first use a synthetic task to show that our model is flexible enough to learn towards full attention when needed in contrast to previous works. We then experiment on Long Range Arena (LRA) [36], a benchmark widely used to assess the capacity of efficient Transformers in learning long-range contexts across different modalities. Lastly, we show results on the GLUE benchmark [39] to assess the performance of SBM-Transformer in a downstream NLP setting. |
| Researcher Affiliation | Collaboration | 1LG AI Research 2KAIST 3University of Illinois Chicago |
| Pseudocode | Yes | Algorithm 1: fast RG(Y , B, Z)[33] |
| Open Source Code | Yes | Our implementation can be found in https://github.com/sc782/SBM-Transformer. |
| Open Datasets | Yes | For empirical evaluations, we first use a synthetic task... We then experiment on Long Range Arena (LRA) [36]... Lastly, we show results on the GLUE benchmark [39]... LRA [36] consists of five different testbeds with varying modalities: LISTOPS [26]... TEXT [24]... RETRIEVAL [30]... IMAGE [21]... PATHFINDER [23]... |
| Dataset Splits | Yes | For this benchmark, we use the Py Torch implementation of LRA provided by the authors of Nyströmformer [43] and adhere to the same train-test splits. We then finetune each pretrained model for 5 epochs on the GLUE training sets. F1 score on the respective validation sets. |
| Hardware Specification | Yes | All experiments were run on a remote GCP server equipped with 16 NVIDIA A100 Tensor Core GPUs. |
| Software Dependencies | No | For this benchmark, we use the Py Torch implementation of LRA provided by the authors of Nyströmformer [43]... No specific version numbers for software are given. |
| Experiment Setup | Yes | Across all methods, we use a single-layer and single-head architecture with 32 hidden dimensions. All models are trained for 2000 epochs where a new batch of sequences is sampled on-the-fly at each epoch. We use a batch size of 256 and learning rate of 1e-3. For fair comparison, we set all Transformer models to use the default setting used in [43], which fixes 2 layers, 2 attention heads, and 64 embedding dimensions. Following previous work [43], we arrange a small variant of BERT [13] with 4 layers, 8 attention heads, and 512 embedding dimensions. We first pretrain each model under the masked language modeling objective for 50 epochs on a corpus with text from English Wikipedia, Book Corpus [50], and Real News [47]. We then finetune each pretrained model for 5 epochs on the GLUE training sets. |