Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Which transformer architecture fits my data? A vocabulary bottleneck in self-attention
Authors: Noam Wies, Yoav Levine, Daniel Jannai, Amnon Shashua
ICML 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate the existence of this bottleneck and its implications on the depth-to-width interplay of Transformer architectures, linking the architecture variability across domains to the often glossed-over usage of different vocabulary sizes or embedding ranks in different domains. As an additional beneο¬t, our rank bottlenecking framework allows us to identify size redundancies of 25% 50% in leading NLP models such as ALBERT and T5. |
| Researcher Affiliation | Academia | 1The Hebrew University of Jerusalem. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | Our training set was English Wikipedia, Book Corpus and Open Web Text, with a total size of 60G. |
| Dataset Splits | No | The paper mentions a training set and a test set, but does not provide specific split percentages or sample counts for training, validation, and test splits from a single dataset, nor does it explicitly mention a validation set. |
| Hardware Specification | No | The paper states "Experiments were performed with Cloud TPUs" but does not specify the version or type of TPUs or any other hardware details. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers. |
| Experiment Setup | Yes | We trained decoder-only language models, by optimizing the autoregressive log-likelihood of the training examples for 1M steps. The remainder of the training details are given in the appendix. |