reproducibilityindex.ai

Searching for Efficient Transformers for Language Modeling

Authors: David So, Wojciech Mańke, Hanxiao Liu, Zihang Dai, Noam Shazeer, Quoc V Le

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show Primer s gains over Transformer increase as compute scale grows and follow a power law with respect to quality at optimal model sizes. We conduct our comparisons across three different codebases: Tensor2Tensor (T2T) [25], T5 [5], and Lingvo [50]. In the following sections, we will present our results in four main experiments on auto-regressive language modeling.
Researcher Affiliation	Industry	David R. So, Wojciech Ma nke, Hanxiao Liu, Zihang Dai, Noam Shazeer, Quoc V. Le Google Research, Brain Team {davidso, wojciechm, hanxiaol, zihangd, noam, qvl}@google.com
Pseudocode	Yes	Figure 4: The two main modiﬁcations that give Primer most of its gains: depthwise convolution added to attention multi-head projections and squared Re LU activations. These modiﬁcations are easy to implement and transfer well across codebases. We call the model with just these two modiﬁcations Primer-EZ. Blue indicates portions of the original Transformer and red signiﬁes one of our proposed modiﬁcations. MDHA Projection Pseudo-code # Use to create each K, Q, and V head of size hs . def mdha_projection(x, hs): # Create head. x = proj(x, head_size=hs, axis="channel") # Apply D Conv to head. x = d_conv(x, width=3, head_size=hs, axis="spatial", mask="causal") return x
Open Source Code	Yes	We open source our models and several comparisons in T5 to help with reproducibility.1 1https://github.com/google-research/google-research/tree/master/primer
Open Datasets	Yes	We give each model a ﬁxed training budget (24 TPUv2 hours) and deﬁne its ﬁtness as its perplexity on the One Billion Words Benchmark (LM1B) [24] in Tensor2Tensor [25]. We also continue training each model to 1M steps to study the effect of larger compute budgets on Primer savings. The results, shown in Figure 9, indicate that the Primer models are as strong in larger data, higher compute regimes, as they are in the smaller LM1B regime. Compared to the vanilla baseline, Primer and Primer-EZ are at least 1.8X more efﬁcient at the end of training on both PG19 and C4.
Dataset Splits	No	The paper refers to 'validation loss' and 'pretraining perplexity' but does not explicitly provide details about specific training, validation, and test dataset splits (e.g., percentages, sample counts, or explicit instructions for creating these splits).
Hardware Specification	Yes	Our experiments show that Primer has the beneﬁts of (1) achieving a target quality using a smaller training cost, (2) achieving higher quality given a ﬁxed training cost, and (3) achieving a target quality using a smaller inference cost. These beneﬁts are robust and hold across model sizes (20M to 1.9B parameters), across compute scales (10 to 105 accelerator hours), across datasets (LM1B, C4, PG19 [22]), across hardware platforms (TPUv2, TPUv3, TPUv4 and V100), across multiple Transformer codebases using default conﬁgurations (Tensor2Tensor, Lingvo, and T5) and across multiple model families (dense Transformers [1], sparse mixture-of-experts Switch Transformers [8], and Synthesizers [23]). Each model is trained using batches of 2M tokens using 512 TPUv4 chips for 140 hours ( 71.8K total accelerator hours or 1M train steps).
Software Dependencies	No	The paper mentions using Tensor Flow, Tensor2Tensor (T2T) [25], T5 [5], and Lingvo [50] codebases but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	First, we first analyze Primer’s performance on the search task: LM1B language modeling with sequence length 64, 35M model parameters, batches of 4096 tokens and 24 hours of training. the batches are increased to 65K tokens, the sequence lengths are a longer 512, each decoder is 110M parameters (dmodel = 768, dff = 3072, L = 12) and each model is trained to 525K steps on 4 TPUv3 chips. This is the same as the C4 conﬁguration in the previous section, but uses batches of 1M tokens, 64 TPUv3 chips and 537M parameters (dmodel = 1024, dff = 8192, L = 24).