Which transformer architecture fits my data? A vocabulary bottleneck in self-attention
Authors: Noam Wies, Yoav Levine, Daniel Jannai, Amnon Shashua
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate the existence of this bottleneck and its implications on the depth-to-width interplay of Transformer architectures, linking the architecture variability across domains to the often glossed-over usage of different vocabulary sizes or embedding ranks in different domains. As an additional beneļ¬t, our rank bottlenecking framework allows us to identify size redundancies of 25% 50% in leading NLP models such as ALBERT and T5. |
| Researcher Affiliation | Academia | 1The Hebrew University of Jerusalem. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | Our training set was English Wikipedia, Book Corpus and Open Web Text, with a total size of 60G. |
| Dataset Splits | No | The paper mentions a training set and a test set, but does not provide specific split percentages or sample counts for training, validation, and test splits from a single dataset, nor does it explicitly mention a validation set. |
| Hardware Specification | No | The paper states "Experiments were performed with Cloud TPUs" but does not specify the version or type of TPUs or any other hardware details. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers. |
| Experiment Setup | Yes | We trained decoder-only language models, by optimizing the autoregressive log-likelihood of the training examples for 1M steps. The remainder of the training details are given in the appendix. |