reproducibilityindex.ai

Which transformer architecture fits my data? A vocabulary bottleneck in self-attention

Authors: Noam Wies, Yoav Levine, Daniel Jannai, Amnon Shashua

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate the existence of this bottleneck and its implications on the depth-to-width interplay of Transformer architectures, linking the architecture variability across domains to the often glossed-over usage of different vocabulary sizes or embedding ranks in different domains. As an additional beneﬁt, our rank bottlenecking framework allows us to identify size redundancies of 25% 50% in leading NLP models such as ALBERT and T5.
Researcher Affiliation	Academia	1The Hebrew University of Jerusalem.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described.
Open Datasets	Yes	Our training set was English Wikipedia, Book Corpus and Open Web Text, with a total size of 60G.
Dataset Splits	No	The paper mentions a training set and a test set, but does not provide specific split percentages or sample counts for training, validation, and test splits from a single dataset, nor does it explicitly mention a validation set.
Hardware Specification	No	The paper states "Experiments were performed with Cloud TPUs" but does not specify the version or type of TPUs or any other hardware details.
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers.
Experiment Setup	Yes	We trained decoder-only language models, by optimizing the autoregressive log-likelihood of the training examples for 1M steps. The remainder of the training details are given in the appendix.