reproducibilityindex.ai

Brainformers: Trading Simplicity for Efficiency

Authors: Yanqi Zhou, Nan Du, Yanping Huang, Daiyi Peng, Chang Lan, Da Huang, Siamak Shakeri, David So, Andrew M. Dai, Yifeng Lu, Zhifeng Chen, Quoc V Le, Claire Cui, James Laudon, Jeff Dean

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Brainformer consistently outperforms the state-of-the-art dense and sparse Transformers, in terms of both quality and efﬁciency. A Brainformer model with 8 billion activated parameters per token demonstrates 2 faster training convergence and 5 faster step time compared to its GLa M counterpart. In downstream task evaluation, Brainformer also demonstrates a 3% higher Super GLUE score with ﬁne-tuning compared to GLa M with a similar number of activated parameters. Additionally, Section 5 is explicitly titled "Evaluation" and contains subsections for "Training Convergence", "Finetuning Results", and "Fewshot Results", all presenting empirical data and comparisons.
Researcher Affiliation	Industry	1Google Deepmind. Correspondence to: Yanqi Zhou <yanqiz@google.com>.
Pseudocode	Yes	Algorithm 1 Brainformer Block Search
Open Source Code	No	The paper does not contain any statements about making its source code publicly available, nor does it provide a link to a code repository for the Brainformer implementation.
Open Datasets	Yes	We use the high-quality dataset from GLa M of 1.6 trillion tokens that are representative of a wide range of natural language use cases. This dataset consists of a high-quality ﬁltered subset of webpages that are combined with smaller corpora of books, Wikipedia pages, conversations, forums, and news to create the ﬁnal dataset. A more detailed description of the dataset including the data and mixture weights can be found in the GLa M paper (Du et al., 2022).
Dataset Splits	Yes	We mainly focus on two types of downstream evaluation: 1) Fine-tuning performance on 11 selected classiﬁcation tasks from the GLUE and Super GLUE benchmarks (Wang et al., 2018; 2019). 2) We evaluate one-shot performance with ﬁve language generation tasks focused on question answering.
Hardware Specification	Yes	We train and evaluate our Brainformer models and baseline models on 64 Cloud TPU-V4 chips, except for models at the 8B-scale which take 512 Cloud TPU-V4 chips to train.
Software Dependencies	No	The paper mentions using an 'Adafactor optimizer (Shazeer & Stern, 2018)' and 'Sentence Piece subword tokenizer' but does not provide specific version numbers for these or any other software dependencies like deep learning frameworks (e.g., TensorFlow, PyTorch).
Experiment Setup	Yes	Our model training follows the setup of GLa M where a maximum sequence length of 1024 tokens is used. We use an Adafactor optimizer (Shazeer & Stern, 2018) with ﬁrst-moment decay β1 = 0 and second-moment decay β2 = 0.99. The learning rate is kept constant for the ﬁrst 10K training steps, then is decayed with an inverse square root schedule. We use the Sentence Piece subword tokenizer with a vocabulary of size of 256K.