Improving Language Plasticity via Pretraining with Active Forgetting

Authors: Yihong Chen, Kelly Marchisio, Roberta Raileanu, David Adelani, Pontus Lars Erik Saito Stenetorp, Sebastian Riedel, Mikel Artetxe

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments with Ro BERTa show that models pretrained with our forgetting mechanism not only demonstrate faster convergence during language adaptation, but also outperform standard ones in a low-data regime, particularly for languages that are distant from English.
Researcher Affiliation Collaboration Yihong Chenàá Kelly Marchisioä Roberta Raileanuá David Ifeoluwa Adelanià Pontus Stenetorpà Sebastian Riedelà Mikel Artetxeê àUCL Centre for Artificial Intelligence áMeta AI êReka AI äCohere AI
Pseudocode Yes Algorithm 1 Active forgetting mechanism. The token embedding layer is reset every K updates.
Open Source Code No Code will be available at https://github.com/ facebookresearch/language-model-plasticity.
Open Datasets Yes Our pretraining model is Ro BERTa-base, a standard 12-layer transformer-based language model. We trained language-specific sentencepiece tokenizers [Kudo and Richardson, 2018] with a vocabulary size of 50K over the corresponding data subsets in CC100. The model was pretrained with the English subset of the CC-100 dataset.
Dataset Splits No The paper states that models were finetuned on English task data and evaluated on test sets, but it does not provide specific percentages or counts for training, validation, and test splits for the datasets used in the main text.
Hardware Specification Yes Our experiments were implemented using fairseq [Ott et al., 2019]. The pretraining and language adaptation experiments were conducted on 32 Tesla V100 GPUs (each with 32 GB memory) and took approximately 24-36 hours to complete.
Software Dependencies No The paper states 'Our experiments were implemented using fairseq [Ott et al., 2019]', but it does not specify any software versions for fairseq or other dependencies.
Experiment Setup Yes The pretraining process consists of 125K updates, with a batch size of 2048. We used a learning rate scheduler with linear decay and an initial learning rate of 7e 4, with 10K warm-up updates. Checkpoints were saved every 500 updates and we always choose the last pretraining checkpoint where possible for optimal performance. For forgetting pretraining, we chose the checkpoint corresponding to the best validation perplexity since the last checkpoint might have token embeddings reset. We set the frequency of forgetting K = 1000 and used a clip-norm of 0.5.