Which transformer architecture fits my data? A vocabulary bottleneck in self-attention

Authors: Noam Wies, Yoav Levine, Daniel Jannai, Amnon Shashua

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate the existence of this bottleneck and its implications on the depth-to-width interplay of Transformer architectures, linking the architecture variability across domains to the often glossed-over usage of different vocabulary sizes or embedding ranks in different domains. As an additional benefit, our rank bottlenecking framework allows us to identify size redundancies of 25% 50% in leading NLP models such as ALBERT and T5.
Researcher Affiliation Academia 1The Hebrew University of Jerusalem.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets Yes Our training set was English Wikipedia, Book Corpus and Open Web Text, with a total size of 60G.
Dataset Splits No The paper mentions a training set and a test set, but does not provide specific split percentages or sample counts for training, validation, and test splits from a single dataset, nor does it explicitly mention a validation set.
Hardware Specification No The paper states "Experiments were performed with Cloud TPUs" but does not specify the version or type of TPUs or any other hardware details.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers.
Experiment Setup Yes We trained decoder-only language models, by optimizing the autoregressive log-likelihood of the training examples for 1M steps. The remainder of the training details are given in the appendix.