reproducibilityindex.ai

Improving Language Plasticity via Pretraining with Active Forgetting

Authors: Yihong Chen, Kelly Marchisio, Roberta Raileanu, David Adelani, Pontus Lars Erik Saito Stenetorp, Sebastian Riedel, Mikel Artetxe

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments with Ro BERTa show that models pretrained with our forgetting mechanism not only demonstrate faster convergence during language adaptation, but also outperform standard ones in a low-data regime, particularly for languages that are distant from English.
Researcher Affiliation	Collaboration	Yihong Chenàá Kelly Marchisioä Roberta Raileanuá David Ifeoluwa Adelanià Pontus Stenetorpà Sebastian Riedelà Mikel Artetxeê àUCL Centre for Artificial Intelligence áMeta AI êReka AI äCohere AI
Pseudocode	Yes	Algorithm 1 Active forgetting mechanism. The token embedding layer is reset every K updates.
Open Source Code	No	Code will be available at https://github.com/ facebookresearch/language-model-plasticity.
Open Datasets	Yes	Our pretraining model is Ro BERTa-base, a standard 12-layer transformer-based language model. We trained language-specific sentencepiece tokenizers [Kudo and Richardson, 2018] with a vocabulary size of 50K over the corresponding data subsets in CC100. The model was pretrained with the English subset of the CC-100 dataset.
Dataset Splits	No	The paper states that models were finetuned on English task data and evaluated on test sets, but it does not provide specific percentages or counts for training, validation, and test splits for the datasets used in the main text.
Hardware Specification	Yes	Our experiments were implemented using fairseq [Ott et al., 2019]. The pretraining and language adaptation experiments were conducted on 32 Tesla V100 GPUs (each with 32 GB memory) and took approximately 24-36 hours to complete.
Software Dependencies	No	The paper states 'Our experiments were implemented using fairseq [Ott et al., 2019]', but it does not specify any software versions for fairseq or other dependencies.
Experiment Setup	Yes	The pretraining process consists of 125K updates, with a batch size of 2048. We used a learning rate scheduler with linear decay and an initial learning rate of 7e 4, with 10K warm-up updates. Checkpoints were saved every 500 updates and we always choose the last pretraining checkpoint where possible for optimal performance. For forgetting pretraining, we chose the checkpoint corresponding to the best validation perplexity since the last checkpoint might have token embeddings reset. We set the frequency of forgetting K = 1000 and used a clip-norm of 0.5.