Scaling Laws for Neural Machine Translation

Authors: Behrooz Ghorbani, Orhan Firat, Markus Freitag, Ankur Bapna, Maxim Krikun, Xavier Garcia, Ciprian Chelba, Colin Cherry

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present an empirical study of the scaling properties of encoder-decoder Transformer models used in neural machine translation (NMT).
Researcher Affiliation Industry Behrooz Ghorbani , Orhan Firat, Markus Freitag, Ankur Bapna, Maxim Krikun, Xavier Garcia, Ciprian Chelba & Colin Cherry Google AI
Pseudocode No No pseudocode or algorithm blocks are present in the paper.
Open Source Code No The paper states 'We release generated text from all models used in this study.' but does not mention releasing the source code for the methodology.
Open Datasets No We use in-house web-crawled training datasets with around 2.2 billion sentence pairs (approximately 55 billion tokens) for English German and 781 million sentence pairs for English Chinese.
Dataset Splits No The paper mentions 'validation data' and 'test sets' used for evaluation during training, but does not specify the exact split percentages, sample counts, or detailed methodology for partitioning the data into train/validation/test sets from the main training corpus. It describes the size of test sets, but not how they are split from the overall dataset.
Hardware Specification Yes We focus our study on large-scale models: our smallest models require 200 TPUv3 days to train to convergence while our largest models require 2700 TPUv3 days of training.
Software Dependencies No The paper mentions 'Adafactor optimizer (Shazeer & Stern, 2018)' and 'Moses scorer' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes Models are trained with per-token cross-entropy loss and Adafactor optimizer (Shazeer & Stern, 2018). All models are trained with a fixed batch-size of 500k tokens and dropout rate of 0.1 for residuals, feed-forward activations and attention. All models are trained to near convergence for 500k training steps. We use a sentence-piece vocabulary of size 32000. Regularization: We use a dropout of 0.1 for residuals, feed-forward activations and attention. Models are trained with label smoothing of magnitude 0.1. To improve the training stability, all models use logit clipping of 10. Optimizer: We use Adafactor (Shazeer & Stern, 2018) optimizer for training our models. We use 40k linear warm-up steps and an inverse square root learning rate schedule. For Adafactor we used momentum with 0.9 and factored second moment to save memory.