Fast Transformers with Clustered Attention
Authors: Apoorv Vyas, Angelos Katharopoulos, François Fleuret
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach on two automatic speech recognition datasets and show that our model consistently outperforms vanilla transformers for a given computational budget. Finally, we demonstrate that our model can approximate arbitrarily complex attention distributions with a minimal number of clusters by approximating a pretrained BERT model on GLUE and SQu AD benchmarks with only 25 clusters and no loss in performance. |
| Researcher Affiliation | Academia | 1Idiap Research Institute, Switzerland 2Ecole Polytechnique F ed erale de Lausanne, Switzerland 3University of Geneva, Switzerland |
| Pseudocode | No | The paper contains mathematical formulations but no clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our Py Torch code can be found at https://clustered-transformers.github.io. |
| Open Datasets | Yes | We employ the Wall-Street Journal dataset [21].; We also evaluate our model on the Switchboard dataset [11]; GLUE [28] and SQu AD [23] benchmarks. |
| Dataset Splits | No | The paper mentions using a 'validation set' for evaluation (e.g., 'achieved PER on the validation set' for WSJ and 'achieved word error rate (WER) in the validation set' for Switchboard), but it does not specify the exact split percentages or sample counts for this set. |
| Hardware Specification | Yes | All experiments are conducted using NVidia GTX 1080 Ti with 11GB of memory and all models are implemented in Py Torch [20]. |
| Software Dependencies | No | The paper states that 'all models are implemented in Py Torch [20]' but does not provide a specific version number for PyTorch or any other software dependency. |
| Experiment Setup | Yes | The input to all transformers is 40-dimensional filter-bank features with fixed positional embeddings. We train using Connectionist Temporal Classification (CTC) [12] loss with phonemes as ground-truth labels. and We train full with 4, 6 and 9 layers to get a range of the required computation time and achieved phone error rate (PER). Similarly, we train i-clustered with 6 and 9 layers. Both models are trained with 100 and 200 clusters. |