Alternating Updates for Efficient Transformers

Authors: Cenk Baykal, Dylan Cutler, Nishanth Dikkala, Nikhil Ghosh, Rina Panigrahy, Xin Wang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on benchmark transformer models and language tasks demonstrate the consistent effectiveness of Alt Up on a diverse set of scenarios.
Researcher Affiliation Collaboration Cenk Baykal Google Research Dylan Cutler Google Research Nishanth Dikkala Google Research Nikhil Ghosh UC Berkeley Rina Panigrahy Google Research Xin Wang Google Research
Pseudocode Yes Algorithm 1 Alternating Updates (Alt Up) Layer
Open Source Code No The paper does not explicitly state that its source code is open-source or provide a link to a code repository.
Open Datasets Yes We performed all of our experiments using T5-model architectures [37] of varying sizes (small, base, large, and 3B) which we pretrained on the C4 dataset for 500, 000 steps with a batch size of 256. The pretrained models were then finetuned on either the GLUE [50], Super GLUE (SG) [49], SQu AD [39] or Trivia-QA (closed-book) [23, 42] benchmark tasks
Dataset Splits Yes We report both pretraining and finetuning metrics: for pretraining, we report span prediction accuracy on a hold-out validation set, and for finetuning, we follow the same recipe as the T5 models, see [37] for more details.
Hardware Specification Yes Latency is measured on TPUv3 with 8 cores.
Software Dependencies No The paper mentions software components like T5X and Adafactor optimizer, but does not provide specific version numbers for any software or libraries.
Experiment Setup Yes We performed all of our experiments using T5-model architectures...pretrained on the C4 dataset for 500,000 steps with a batch size of 256. The pretrained models were then finetuned...for a further 50,000 steps with a batch-size of 256. During pretraining, we use 256 batch size, Adafactor optimizer [46] with base learning rate 1.0 and reciprocal square-root decay with 10000 warmup steps, and zero dropout. During finetuning, we use 256 batch size, Adafactor optimizer with constant learning rate of 0.001 and 0.1 dropout.