Lifelong Language Pretraining with Distribution-Specialized Experts

Authors: Wuyang Chen, Yanqi Zhou, Nan Du, Yanping Huang, James Laudon, Zhifeng Chen, Claire Cui

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results show that by only introducing a limited number of extra experts while keeping the computation cost constant, our model can steadily adapt to data distribution shifts while preserving the previous knowledge. Compared to existing lifelong learning approaches, Lifelong Mo E achieves better few-shot performance on 19 downstream NLP tasks.We achieve state-of-the-art decoding scores on downstream one/zero-shot tasks, including the QA task, the translation task, and other language understanding tasks.
Researcher Affiliation Collaboration Wuyang Chen * 1 Yanqi Zhou 2 Nan Du 2 Yanping Huang 2 James Laudon 2 Zhifeng Chen 2 Claire Cui 2 1The University of Texas at Austin 2Google. Correspondence to: Yanqi Zhou <yanqiz@google.com>, Nan Du <dunan@google.com>.
Pseudocode No The paper describes the method in text and uses diagrams (Figure 1, Figure 2) but does not include structured pseudocode or algorithm blocks.
Open Source Code No The paper does not include an explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes To simulate the distribution-level lifelong pretraining setting, we build a sequence of billions of tokens that are representative of a wide range of natural language distributions (both English and non-English), based on the GLa M dataset (Du et al., 2022). We collect webpages and Wikipedia pages (with a combination ratio of 81% : 19% following (Du et al., 2022)) as our first distribution, denoted as A . i18n ( internationalization ), the non-English corpus, will be our second distribution B . Finally, the conversations from public domain social media (Adiwardana et al., 2020) constitutes our third distribution C .
Dataset Splits No The paper describes pretraining models sequentially on distributions A, B, and C, and monitoring next-token accuracy and perplexity on 'all three distributions throughout all pretraining phases,' but does not specify explicit train/validation/test splits for these pretraining datasets.
Hardware Specification Yes The largest Lifelong-Mo E model has 1.878B activated parameters with 40 experts (per expert-layer) and is trained on 128 Cloud TPU-V4 chips.
Software Dependencies No The paper mentions software components like 'Adafactor' and 'Sentence Piece' and data types like 'float32' and 'bfloat16' but does not provide specific version numbers for any software, libraries, or programming languages used.
Experiment Setup Yes We use a maximum sequence length of 1024 tokens in each minibatch, and pack each input example to have up to 1 million tokens per batch. The dropout rate is set to 0 since the number of available tokens in the training corpus is much greater than the number of processed tokens during training. Our optimizer is Adafactor (Shazeer & Stern, 2018) with first-moment decay β1 = 0, second-moment decay β2 = 0.99 with a 1 t 0.8 decay schedule, update clipping threshold of 1.0, and factored second-moment estimation. When pretraining on each data distribution, we keep the initial learning rate as 0.01 for the first 10K training steps, and then decay it with inverse square root schedule lr t 1 t.