Searching for Efficient Transformers for Language Modeling
Authors: David So, Wojciech Mańke, Hanxiao Liu, Zihang Dai, Noam Shazeer, Quoc V Le
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show Primer s gains over Transformer increase as compute scale grows and follow a power law with respect to quality at optimal model sizes. We conduct our comparisons across three different codebases: Tensor2Tensor (T2T) [25], T5 [5], and Lingvo [50]. In the following sections, we will present our results in four main experiments on auto-regressive language modeling. |
| Researcher Affiliation | Industry | David R. So, Wojciech Ma nke, Hanxiao Liu, Zihang Dai, Noam Shazeer, Quoc V. Le Google Research, Brain Team {davidso, wojciechm, hanxiaol, zihangd, noam, qvl}@google.com |
| Pseudocode | Yes | Figure 4: The two main modifications that give Primer most of its gains: depthwise convolution added to attention multi-head projections and squared Re LU activations. These modifications are easy to implement and transfer well across codebases. We call the model with just these two modifications Primer-EZ. Blue indicates portions of the original Transformer and red signifies one of our proposed modifications. MDHA Projection Pseudo-code # Use to create each K, Q, and V head of size hs . def mdha_projection(x, hs): # Create head. x = proj(x, head_size=hs, axis="channel") # Apply D Conv to head. x = d_conv(x, width=3, head_size=hs, axis="spatial", mask="causal") return x |
| Open Source Code | Yes | We open source our models and several comparisons in T5 to help with reproducibility.1 1https://github.com/google-research/google-research/tree/master/primer |
| Open Datasets | Yes | We give each model a fixed training budget (24 TPUv2 hours) and define its fitness as its perplexity on the One Billion Words Benchmark (LM1B) [24] in Tensor2Tensor [25]. We also continue training each model to 1M steps to study the effect of larger compute budgets on Primer savings. The results, shown in Figure 9, indicate that the Primer models are as strong in larger data, higher compute regimes, as they are in the smaller LM1B regime. Compared to the vanilla baseline, Primer and Primer-EZ are at least 1.8X more efficient at the end of training on both PG19 and C4. |
| Dataset Splits | No | The paper refers to 'validation loss' and 'pretraining perplexity' but does not explicitly provide details about specific training, validation, and test dataset splits (e.g., percentages, sample counts, or explicit instructions for creating these splits). |
| Hardware Specification | Yes | Our experiments show that Primer has the benefits of (1) achieving a target quality using a smaller training cost, (2) achieving higher quality given a fixed training cost, and (3) achieving a target quality using a smaller inference cost. These benefits are robust and hold across model sizes (20M to 1.9B parameters), across compute scales (10 to 105 accelerator hours), across datasets (LM1B, C4, PG19 [22]), across hardware platforms (TPUv2, TPUv3, TPUv4 and V100), across multiple Transformer codebases using default configurations (Tensor2Tensor, Lingvo, and T5) and across multiple model families (dense Transformers [1], sparse mixture-of-experts Switch Transformers [8], and Synthesizers [23]). Each model is trained using batches of 2M tokens using 512 TPUv4 chips for 140 hours ( 71.8K total accelerator hours or 1M train steps). |
| Software Dependencies | No | The paper mentions using Tensor Flow, Tensor2Tensor (T2T) [25], T5 [5], and Lingvo [50] codebases but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | First, we first analyze Primer’s performance on the search task: LM1B language modeling with sequence length 64, 35M model parameters, batches of 4096 tokens and 24 hours of training. the batches are increased to 65K tokens, the sequence lengths are a longer 512, each decoder is 110M parameters (dmodel = 768, dff = 3072, L = 12) and each model is trained to 525K steps on 4 TPUv3 chips. This is the same as the C4 configuration in the previous section, but uses batches of 1M tokens, 64 TPUv3 chips and 537M parameters (dmodel = 1024, dff = 8192, L = 24). |