reproducibilityindex.ai

Alternating Updates for Efficient Transformers

Authors: Cenk Baykal, Dylan Cutler, Nishanth Dikkala, Nikhil Ghosh, Rina Panigrahy, Xin Wang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments on benchmark transformer models and language tasks demonstrate the consistent effectiveness of Alt Up on a diverse set of scenarios.
Researcher Affiliation	Collaboration	Cenk Baykal Google Research Dylan Cutler Google Research Nishanth Dikkala Google Research Nikhil Ghosh UC Berkeley Rina Panigrahy Google Research Xin Wang Google Research
Pseudocode	Yes	Algorithm 1 Alternating Updates (Alt Up) Layer
Open Source Code	No	The paper does not explicitly state that its source code is open-source or provide a link to a code repository.
Open Datasets	Yes	We performed all of our experiments using T5-model architectures [37] of varying sizes (small, base, large, and 3B) which we pretrained on the C4 dataset for 500, 000 steps with a batch size of 256. The pretrained models were then ﬁnetuned on either the GLUE [50], Super GLUE (SG) [49], SQu AD [39] or Trivia-QA (closed-book) [23, 42] benchmark tasks
Dataset Splits	Yes	We report both pretraining and ﬁnetuning metrics: for pretraining, we report span prediction accuracy on a hold-out validation set, and for ﬁnetuning, we follow the same recipe as the T5 models, see [37] for more details.
Hardware Specification	Yes	Latency is measured on TPUv3 with 8 cores.
Software Dependencies	No	The paper mentions software components like T5X and Adafactor optimizer, but does not provide specific version numbers for any software or libraries.
Experiment Setup	Yes	We performed all of our experiments using T5-model architectures...pretrained on the C4 dataset for 500,000 steps with a batch size of 256. The pretrained models were then ﬁnetuned...for a further 50,000 steps with a batch-size of 256. During pretraining, we use 256 batch size, Adafactor optimizer [46] with base learning rate 1.0 and reciprocal square-root decay with 10000 warmup steps, and zero dropout. During ﬁnetuning, we use 256 batch size, Adafactor optimizer with constant learning rate of 0.001 and 0.1 dropout.