reproducibilityindex.ai

O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers

Authors: Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, Sanjiv Kumar

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Lastly, we present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
Researcher Affiliation	Collaboration	Chulhee Yun MIT chulheey@mit.edu Yin-Wen Chang Google Research NY yinwen@google.com
Pseudocode	No	The paper does not include pseudocode blocks or algorithm sections.
Open Source Code	No	The paper does not contain any statements about releasing code or links to a code repository for the described methodology.
Open Datasets	Yes	We conduct the language modeling experiments on the One Billion Word Benchmark [5]... For the translation task, we train the model on WMT18 English-Czech (en-cs) dataset and test it on the Newstest 2015 dataset... We experiment with the BERTBASE model and report results on two sentence-pair classiﬁcation tasks: MNLI [30] (Figure 2a) and XNLI [7] (Figure 2b).
Dataset Splits	Yes	For the translation task, we train the model on WMT18 English-Czech (en-cs) dataset and test it on the Newstest 2015 dataset... We plot the average accuracy of three runs on the dev set against the sparsity level.
Hardware Specification	No	The paper does not specify any hardware details like GPU or CPU models used for running the experiments.
Software Dependencies	No	For language modeling and translation, we use the Tensor2Tensor [29] framework and employ 12-block and 6-block (respectively) Transformers with 8 attention heads per block. For GLUE tasks, we experiment with the BERTBASE model. No version numbers for Tensor2Tensor or BERTBASE are provided.
Experiment Setup	Yes	We use maximum sequence length 256 in all our experiments, except 128 for GLUE tasks. For the copying task, we experiment with only one sparse Transformer block (cf. Eq (2)), with varying numbers of attention layers with 4 attention heads. For language modeling and translation, we use the Tensor2Tensor [29] framework and employ 12-block and 6-block (respectively) Transformers with 8 attention heads per block. For GLUE tasks, we experiment with the BERTBASE model.