reproducibilityindex.ai

Fast Transformers with Clustered Attention

Authors: Apoorv Vyas, Angelos Katharopoulos, François Fleuret

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our approach on two automatic speech recognition datasets and show that our model consistently outperforms vanilla transformers for a given computational budget. Finally, we demonstrate that our model can approximate arbitrarily complex attention distributions with a minimal number of clusters by approximating a pretrained BERT model on GLUE and SQu AD benchmarks with only 25 clusters and no loss in performance.
Researcher Affiliation	Academia	1Idiap Research Institute, Switzerland 2Ecole Polytechnique F ed erale de Lausanne, Switzerland 3University of Geneva, Switzerland
Pseudocode	No	The paper contains mathematical formulations but no clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our Py Torch code can be found at https://clustered-transformers.github.io.
Open Datasets	Yes	We employ the Wall-Street Journal dataset [21].; We also evaluate our model on the Switchboard dataset [11]; GLUE [28] and SQu AD [23] benchmarks.
Dataset Splits	No	The paper mentions using a 'validation set' for evaluation (e.g., 'achieved PER on the validation set' for WSJ and 'achieved word error rate (WER) in the validation set' for Switchboard), but it does not specify the exact split percentages or sample counts for this set.
Hardware Specification	Yes	All experiments are conducted using NVidia GTX 1080 Ti with 11GB of memory and all models are implemented in Py Torch [20].
Software Dependencies	No	The paper states that 'all models are implemented in Py Torch [20]' but does not provide a specific version number for PyTorch or any other software dependency.
Experiment Setup	Yes	The input to all transformers is 40-dimensional ﬁlter-bank features with ﬁxed positional embeddings. We train using Connectionist Temporal Classiﬁcation (CTC) [12] loss with phonemes as ground-truth labels. and We train full with 4, 6 and 9 layers to get a range of the required computation time and achieved phone error rate (PER). Similarly, we train i-clustered with 6 and 9 layers. Both models are trained with 100 and 200 clusters.