GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

Authors: Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P Bosma, Zongwei Zhou, Tao Wang, Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc Le, Yonghui Wu, Zhifeng Chen, Claire Cui

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The largest GLa M has 1.2 trillion parameters, which is approximately 7x larger than GPT-3. It consumes only 1/3 of the energy used to train GPT-3 and requires half of the computation flops for inference, while still achieving better overall zero, one and few-shot performance across 29 NLP tasks.
Researcher Affiliation Industry 1Google. Correspondence to: Nan Du, Yanping Huang, and Andrew M. Dai <dunan@google.com, huangyp@google.com, adai@google.com>.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures).
Open Source Code No The paper does not provide concrete access to source code (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described in this paper. There are no mentions of code repositories or public code releases by the authors.
Open Datasets Yes To train our model, we build a high-quality dataset of 1.6 trillion tokens... We also incorporate the data from public domain social media conversations used by Adiwardana et al. (2020). We set the mixture weights based on the performance of each component in a smaller model and to prevent small sources such as Wikipedia from being over-sampled. Table 3 shows the details of our data component sizes and mixture weights.
Dataset Splits Yes For the zero-shot learning setting, in most cases, we evaluate each example in the development set directly. For one/few-shot learning, we mainly draw random one/few examples from that task s training set as the only demonstration and context... For each evaluation dataset we report the number of examples which overlap with the pretraining data... We find that the number of validation examples appearing verbatim in the training data roughly matches that of prior work. We report these numbers in Table 6.
Hardware Specification Yes The largest GLa M 64B/64E model was trained on 1,024 Cloud TPU-V4 chips. Using 326W measured system power per TPU-v4 chip, this leads to a total energy consumption of 213 MWh for GLa M.
Software Dependencies No The paper mentions several software components such as SentencePiece tokenizer, Adafactor optimizer, GSPMD compiler pass, and Perspective API, but does not provide specific version numbers for any of these dependencies.
Experiment Setup Yes We use the same learning hyperparameters for all GLa M models. More specifically, We use a maximum sequence length of 1024 tokens, and pack each input example to have up to 1 million tokens per batch. The dropout rate is set to 0... Our optimizer is Adafactor (Shazeer & Stern, 2018) with first-moment decay β1 = 0, second-moment decay β2 = 0.99 with a 1 t 0.8 decay schedule, update clipping threshold of 1.0, and factored second-moment estimation. We keep the initial learning rate of 0.01 for the first 10K training steps, and then decay it with inverse square root schedule lr t 1 t. On top of the standard cross-entropy loss, we add the Mo E auxiliary loss as described in GShard (Lepikhin et al., 2021) with a 0.01 coefficient...