O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers

Authors: Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, Sanjiv Kumar

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Lastly, we present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
Researcher Affiliation Collaboration Chulhee Yun MIT chulheey@mit.edu Yin-Wen Chang Google Research NY yinwen@google.com
Pseudocode No The paper does not include pseudocode blocks or algorithm sections.
Open Source Code No The paper does not contain any statements about releasing code or links to a code repository for the described methodology.
Open Datasets Yes We conduct the language modeling experiments on the One Billion Word Benchmark [5]... For the translation task, we train the model on WMT18 English-Czech (en-cs) dataset and test it on the Newstest 2015 dataset... We experiment with the BERTBASE model and report results on two sentence-pair classification tasks: MNLI [30] (Figure 2a) and XNLI [7] (Figure 2b).
Dataset Splits Yes For the translation task, we train the model on WMT18 English-Czech (en-cs) dataset and test it on the Newstest 2015 dataset... We plot the average accuracy of three runs on the dev set against the sparsity level.
Hardware Specification No The paper does not specify any hardware details like GPU or CPU models used for running the experiments.
Software Dependencies No For language modeling and translation, we use the Tensor2Tensor [29] framework and employ 12-block and 6-block (respectively) Transformers with 8 attention heads per block. For GLUE tasks, we experiment with the BERTBASE model. No version numbers for Tensor2Tensor or BERTBASE are provided.
Experiment Setup Yes We use maximum sequence length 256 in all our experiments, except 128 for GLUE tasks. For the copying task, we experiment with only one sparse Transformer block (cf. Eq (2)), with varying numbers of attention layers with 4 attention heads. For language modeling and translation, we use the Tensor2Tensor [29] framework and employ 12-block and 6-block (respectively) Transformers with 8 attention heads per block. For GLUE tasks, we experiment with the BERTBASE model.