Brainformers: Trading Simplicity for Efficiency

Authors: Yanqi Zhou, Nan Du, Yanping Huang, Daiyi Peng, Chang Lan, Da Huang, Siamak Shakeri, David So, Andrew M. Dai, Yifeng Lu, Zhifeng Chen, Quoc V Le, Claire Cui, James Laudon, Jeff Dean

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Brainformer consistently outperforms the state-of-the-art dense and sparse Transformers, in terms of both quality and efficiency. A Brainformer model with 8 billion activated parameters per token demonstrates 2 faster training convergence and 5 faster step time compared to its GLa M counterpart. In downstream task evaluation, Brainformer also demonstrates a 3% higher Super GLUE score with fine-tuning compared to GLa M with a similar number of activated parameters. Additionally, Section 5 is explicitly titled "Evaluation" and contains subsections for "Training Convergence", "Finetuning Results", and "Fewshot Results", all presenting empirical data and comparisons.
Researcher Affiliation Industry 1Google Deepmind. Correspondence to: Yanqi Zhou <yanqiz@google.com>.
Pseudocode Yes Algorithm 1 Brainformer Block Search
Open Source Code No The paper does not contain any statements about making its source code publicly available, nor does it provide a link to a code repository for the Brainformer implementation.
Open Datasets Yes We use the high-quality dataset from GLa M of 1.6 trillion tokens that are representative of a wide range of natural language use cases. This dataset consists of a high-quality filtered subset of webpages that are combined with smaller corpora of books, Wikipedia pages, conversations, forums, and news to create the final dataset. A more detailed description of the dataset including the data and mixture weights can be found in the GLa M paper (Du et al., 2022).
Dataset Splits Yes We mainly focus on two types of downstream evaluation: 1) Fine-tuning performance on 11 selected classification tasks from the GLUE and Super GLUE benchmarks (Wang et al., 2018; 2019). 2) We evaluate one-shot performance with five language generation tasks focused on question answering.
Hardware Specification Yes We train and evaluate our Brainformer models and baseline models on 64 Cloud TPU-V4 chips, except for models at the 8B-scale which take 512 Cloud TPU-V4 chips to train.
Software Dependencies No The paper mentions using an 'Adafactor optimizer (Shazeer & Stern, 2018)' and 'Sentence Piece subword tokenizer' but does not provide specific version numbers for these or any other software dependencies like deep learning frameworks (e.g., TensorFlow, PyTorch).
Experiment Setup Yes Our model training follows the setup of GLa M where a maximum sequence length of 1024 tokens is used. We use an Adafactor optimizer (Shazeer & Stern, 2018) with first-moment decay β1 = 0 and second-moment decay β2 = 0.99. The learning rate is kept constant for the first 10K training steps, then is decayed with an inverse square root schedule. We use the Sentence Piece subword tokenizer with a vocabulary of size of 256K.