Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Alternating Updates for Efficient Transformers
Authors: Cenk Baykal, Dylan Cutler, Nishanth Dikkala, Nikhil Ghosh, Rina Panigrahy, Xin Wang
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on benchmark transformer models and language tasks demonstrate the consistent effectiveness of Alt Up on a diverse set of scenarios. |
| Researcher Affiliation | Collaboration | Cenk Baykal Google Research Dylan Cutler Google Research Nishanth Dikkala Google Research Nikhil Ghosh UC Berkeley Rina Panigrahy Google Research Xin Wang Google Research |
| Pseudocode | Yes | Algorithm 1 Alternating Updates (Alt Up) Layer |
| Open Source Code | No | The paper does not explicitly state that its source code is open-source or provide a link to a code repository. |
| Open Datasets | Yes | We performed all of our experiments using T5-model architectures [37] of varying sizes (small, base, large, and 3B) which we pretrained on the C4 dataset for 500, 000 steps with a batch size of 256. The pretrained models were then finetuned on either the GLUE [50], Super GLUE (SG) [49], SQu AD [39] or Trivia-QA (closed-book) [23, 42] benchmark tasks |
| Dataset Splits | Yes | We report both pretraining and finetuning metrics: for pretraining, we report span prediction accuracy on a hold-out validation set, and for finetuning, we follow the same recipe as the T5 models, see [37] for more details. |
| Hardware Specification | Yes | Latency is measured on TPUv3 with 8 cores. |
| Software Dependencies | No | The paper mentions software components like T5X and Adafactor optimizer, but does not provide specific version numbers for any software or libraries. |
| Experiment Setup | Yes | We performed all of our experiments using T5-model architectures...pretrained on the C4 dataset for 500,000 steps with a batch size of 256. The pretrained models were then finetuned...for a further 50,000 steps with a batch-size of 256. During pretraining, we use 256 batch size, Adafactor optimizer [46] with base learning rate 1.0 and reciprocal square-root decay with 10000 warmup steps, and zero dropout. During finetuning, we use 256 batch size, Adafactor optimizer with constant learning rate of 0.001 and 0.1 dropout. |