Scaling Laws for Multilingual Neural Machine Translation

Authors: Patrick Fernandes, Behrooz Ghorbani, Xavier Garcia, Markus Freitag, Orhan Firat

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we provide a large-scale empirical study of the scaling properties of multilingual neural machine translation models. We examine how increases in the model size affect the model performance and investigate the role of the training mixture composition on the scaling behavior. For our analysis, we train over 200 MNMT models (ranging from 20M to 1B non-embedding parameters) and systematically examine their scaling behaviors.
Researcher Affiliation Collaboration 1Google Research 2Carnegie Mellon University 3Instituto Superior T ecnico. Correspondence to: Patrick Fernandes <pfernand@cs.cmu.edu>.
Pseudocode No The paper does not include any pseudocode or algorithm blocks. Procedures are described in natural language.
Open Source Code No The paper does not provide any concrete access to source code for the methodology described. There are no links to repositories or explicit statements about code release.
Open Datasets Yes For out-of-domain, we use newstest2019 (Barrault et al., 2019), consisting of 2000 sentence-pairs extracted from aligned news documents.
Dataset Splits No The paper explicitly mentions test sets but does not specify a validation dataset split (e.g., percentages or counts for training, validation, and test sets, or a dedicated validation set). It mentions using an "in-house web-crawled dataset for training our models" and extracting 2000 sentences for an in-domain test set, but no explicit validation split.
Hardware Specification No The paper does not specify the hardware used for its experiments (e.g., specific GPU models, CPU types, or cloud instance specifications). It mentions training
Software Dependencies No The paper mentions the use of specific software components like "pre-LN encoder-decoder Transformer architecture", "Adafactor optimizer", and "multilingual Sentence Piece" but does not provide specific version numbers for these, which is necessary for reproducibility.
Experiment Setup Yes We train models of up to 8 sizes, approximately ranging from 20M to 1B (non-embedding) parameters. We use the pre-LN encoder-decoder Transformer architecture in our models. The models are trained with per-token cross-entropy loss and Adafactor optimizer (Shazeer & Stern, 2018), using a fixed batch size of 500K tokens and inverse square root learning rate schedule. In practice, this translates to training our smaller models (< 500M parameters) for 500K gradient steps and larger models for 1M steps. We tokenize this corpus by using a pretrained multilingual Sentence Piece (Kudo, 2018) vocabulary, with a size of 128K sub-words. Each observation in the training batch is chosen from the first language pair with probability p and the second language pair with probability 1 p. For our experiments, we choose p from the set p {0, 0.05, 0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 1}. Appendix A details model sizes and hyperparameters like encoder/decoder layers, embedding dimension, number of heads, and MLP dimension.