Scaling Laws for Generative Mixed-Modal Language Models

Authors: Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, Luke Zettlemoyer

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To better understand the scaling properties of such mixed-modal models, we conducted over 250 experiments using seven different modalities and model sizes ranging from 8 million to 30 billion, trained on 5100 billion tokens. We report new mixed-modal scaling laws that unify the contributions of individual modalities and the interactions between them.
Researcher Affiliation Collaboration 1FAIR 2Yereva NN 3University of Washington.
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes All models were trained using the metaseq1 code base, which includes an implementation of causal masking (Zhang et al., 2022). 1https://github.com/facebookresearch/metaseq
Open Datasets Yes Text For our text corpus, we use the same data as was used in OPT Zhang et al. (2022) for a total of 180B tokens. Image For all images, we convert them to discrete tokens using the Make-A-Scene visual tokenizer (Gafni et al., 2022), which gives 1024 tokens from an 8192 vocabulary per image. We select a custom subset of 600 million images across Schuhmann et al. (2022), and a custom image-text dataset scraped from Common Crawl. Speech We used a combination of custom web-mined speech data and unlabeled speech in several public datasets. Our public data collection covers various speech styles and content topics, including Libri Speech (Read-Books), Common Voice in Read-Wiki, Vox Populi from the Parliament domain, and Spotify Podcast and People s Speech as web speech. Speech-Text Many public datasets also come with text aligned with speech. We take ASR and TTS data from Multilingual Libra Speech and Vox Populi and form the Speech-Text dataset. Code We use the In Coder data (Fried et al., 2022). Molecules We utilize the Simplified Molecular Input Line Entry System (SMILES, where the chemical s structure is serialized into a string of symbols) representation from the Zinc dataset prepared by Chilingaryan et al. (2022).
Dataset Splits No The paper does not explicitly provide percentages or counts for training, validation, and test splits for the main model training. While it mentions 'validation split' in the context of image tokenization benchmarking, this is not for the overall model training splits.
Hardware Specification Yes All experiments were conducted in a two-month time frame with a cluster of 768 80GB A100 GPUs. The majority of experiments used 64 GPUs at a time.
Software Dependencies No The training used the Py Torch framework (Paszke et al., 2019), with fairscale to improve memory efficiency through fully sharded model and optimizer states (Baines et al., 2021). The training also uses Megatron-LM Tensor Parallelism Shoeybi et al. (2019) to support large model runs, and we use bf16 (Kalamkar et al., 2019) to improve training stability. We tracked all experiments using the Aim experiment tracker (Arakelyan et al., 2020). The paper mentions software but does not specify version numbers for PyTorch, fairscale, Megatron-LM, bf16, or Aim.
Experiment Setup Yes The batch size per GPU was determined based on the total world size of the experiment, the level of model parallelism, and the total target batch size in terms of the number of tokens. To ensure stable training, we applied gradient clipping with a maximum norm of 1.0 and used the Adam optimizer with β1 = 0.9, β2 = 0.98 (Kingma & Ba, 2015). We used the built-in polynomial decay learning rate scheduler in Meta Seq with 500 warmup updates and the end learning rate set to 10% of the peak learning rate. Additionally, Table 1 provides details such as 'Batch Size' and 'LR' for various model sizes (e.g., '1M' for batch size, '1.00E-03' for LR).