O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers
Authors: Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, Sanjiv Kumar
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Lastly, we present experiments comparing different patterns/levels of sparsity on standard NLP tasks. |
| Researcher Affiliation | Collaboration | Chulhee Yun MIT chulheey@mit.edu Yin-Wen Chang Google Research NY yinwen@google.com |
| Pseudocode | No | The paper does not include pseudocode blocks or algorithm sections. |
| Open Source Code | No | The paper does not contain any statements about releasing code or links to a code repository for the described methodology. |
| Open Datasets | Yes | We conduct the language modeling experiments on the One Billion Word Benchmark [5]... For the translation task, we train the model on WMT18 English-Czech (en-cs) dataset and test it on the Newstest 2015 dataset... We experiment with the BERTBASE model and report results on two sentence-pair classification tasks: MNLI [30] (Figure 2a) and XNLI [7] (Figure 2b). |
| Dataset Splits | Yes | For the translation task, we train the model on WMT18 English-Czech (en-cs) dataset and test it on the Newstest 2015 dataset... We plot the average accuracy of three runs on the dev set against the sparsity level. |
| Hardware Specification | No | The paper does not specify any hardware details like GPU or CPU models used for running the experiments. |
| Software Dependencies | No | For language modeling and translation, we use the Tensor2Tensor [29] framework and employ 12-block and 6-block (respectively) Transformers with 8 attention heads per block. For GLUE tasks, we experiment with the BERTBASE model. No version numbers for Tensor2Tensor or BERTBASE are provided. |
| Experiment Setup | Yes | We use maximum sequence length 256 in all our experiments, except 128 for GLUE tasks. For the copying task, we experiment with only one sparse Transformer block (cf. Eq (2)), with varying numbers of attention layers with 4 attention heads. For language modeling and translation, we use the Tensor2Tensor [29] framework and employ 12-block and 6-block (respectively) Transformers with 8 attention heads per block. For GLUE tasks, we experiment with the BERTBASE model. |