Brainformers: Trading Simplicity for Efficiency
Authors: Yanqi Zhou, Nan Du, Yanping Huang, Daiyi Peng, Chang Lan, Da Huang, Siamak Shakeri, David So, Andrew M. Dai, Yifeng Lu, Zhifeng Chen, Quoc V Le, Claire Cui, James Laudon, Jeff Dean
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Brainformer consistently outperforms the state-of-the-art dense and sparse Transformers, in terms of both quality and efficiency. A Brainformer model with 8 billion activated parameters per token demonstrates 2 faster training convergence and 5 faster step time compared to its GLa M counterpart. In downstream task evaluation, Brainformer also demonstrates a 3% higher Super GLUE score with fine-tuning compared to GLa M with a similar number of activated parameters. Additionally, Section 5 is explicitly titled "Evaluation" and contains subsections for "Training Convergence", "Finetuning Results", and "Fewshot Results", all presenting empirical data and comparisons. |
| Researcher Affiliation | Industry | 1Google Deepmind. Correspondence to: Yanqi Zhou <yanqiz@google.com>. |
| Pseudocode | Yes | Algorithm 1 Brainformer Block Search |
| Open Source Code | No | The paper does not contain any statements about making its source code publicly available, nor does it provide a link to a code repository for the Brainformer implementation. |
| Open Datasets | Yes | We use the high-quality dataset from GLa M of 1.6 trillion tokens that are representative of a wide range of natural language use cases. This dataset consists of a high-quality filtered subset of webpages that are combined with smaller corpora of books, Wikipedia pages, conversations, forums, and news to create the final dataset. A more detailed description of the dataset including the data and mixture weights can be found in the GLa M paper (Du et al., 2022). |
| Dataset Splits | Yes | We mainly focus on two types of downstream evaluation: 1) Fine-tuning performance on 11 selected classification tasks from the GLUE and Super GLUE benchmarks (Wang et al., 2018; 2019). 2) We evaluate one-shot performance with five language generation tasks focused on question answering. |
| Hardware Specification | Yes | We train and evaluate our Brainformer models and baseline models on 64 Cloud TPU-V4 chips, except for models at the 8B-scale which take 512 Cloud TPU-V4 chips to train. |
| Software Dependencies | No | The paper mentions using an 'Adafactor optimizer (Shazeer & Stern, 2018)' and 'Sentence Piece subword tokenizer' but does not provide specific version numbers for these or any other software dependencies like deep learning frameworks (e.g., TensorFlow, PyTorch). |
| Experiment Setup | Yes | Our model training follows the setup of GLa M where a maximum sequence length of 1024 tokens is used. We use an Adafactor optimizer (Shazeer & Stern, 2018) with first-moment decay β1 = 0 and second-moment decay β2 = 0.99. The learning rate is kept constant for the first 10K training steps, then is decayed with an inverse square root schedule. We use the Sentence Piece subword tokenizer with a vocabulary of size of 256K. |