MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining

Authors: Jacob Portes, Alexander Trott, Sam Havens, DANIEL KING, Abhinav Venigalla, Moin Nadeem, Nikhil Sardana, Daya Khudia, Jonathan Frankle

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here, we introduce Mosaic BERT, a BERT-style encoder architecture and training recipe that is empirically optimized for fast pretraining.
Researcher Affiliation Industry Jacob Portes1 jacob.portes@ Alex Trott1 alex.trott@ Sam Havens1 sam.havens@ Daniel King1 daniel.king@ Abhinav Venigalla1 abhi@ Moin Nadeem2 moinnadeem@ Nikhil Sardana1 nikhil.sardana@ Daya Khudia1 daya.khudia@ Jonathan Frankle1 jfrankle@ 1Mosaic ML Databricks 1@databricks.com 2@moinnadeem.com
Pseudocode No The paper describes architectural modifications graphically (Figure S6) but does not include formal pseudocode or algorithm blocks.
Open Source Code Yes We open-source our model weights and code at mosaicbert.github.io. Code for pretraining and finetuning Mosaic BERT can be found in the Mosaic ML examples repository https://github.com/mosaicml/examples. The exact code for this study was pinned to v0.0.4 of the Mosaic ML mosaicml/examples repository https://github.com/mosaicml/examples/ tree/v0.0.4/examples/bert. All pretraining and finetuning was done in Py Torch 1.13 using the Mosaic ML Composer library https://github.com/mosaicml/composer. Model weights for Mosaic BERT-Base can be found on the Hugging Face hub https://huggingface.co/mosaicml/ mosaic-bert-base.
Open Datasets Yes Here we chose to train all models on the more modern Colossal Cleaned Common Crawl (C4) corpus [46].
Dataset Splits Yes In our first set of experiments, we pretrained BERT-Base and Mosaic BERT-Base for 70,000 steps of batch size 4096... We then finetuned these models on the GLUE benchmark suite using identical finetuning parameters... All evaluation was done on the validation (a.k.a. dev) splits. ...MNLI (Multi-Genre Natural Language Inference) [392,702 train | 19,643 validation | 19,643 test]
Hardware Specification Yes When pretrained from scratch on the C4 dataset, this base model achieves a downstream average GLUE (dev) score of 79.6 in 1.13 hours on 8 A100 80 GB GPUs at a cost of roughly $20.
Software Dependencies Yes All pretraining and finetuning was done in Py Torch 1.13 using the Mosaic ML Composer library https://github.com/mosaicml/composer.
Experiment Setup Yes For all models, we use a global batch size of 4096, and microbatch size of 128. We set the maximum sequence length during pretraining to 128, and we used the standard embedding dimension of 768. These hyperparameters were the same for Mosaic BERT-Base and the baseline BERT-Base. More hyperparameter details are included in the Appendix. (See Table S1 for detailed hyperparameters: Optimizer, LR, beta, eps, WD, MB, Warmup, Final LR, MLM for BERT-Base, Mosaic BERT-Base, BERT-Large, Mosaic BERT-Large).