Scaling Laws for Neural Machine Translation
Authors: Behrooz Ghorbani, Orhan Firat, Markus Freitag, Ankur Bapna, Maxim Krikun, Xavier Garcia, Ciprian Chelba, Colin Cherry
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present an empirical study of the scaling properties of encoder-decoder Transformer models used in neural machine translation (NMT). |
| Researcher Affiliation | Industry | Behrooz Ghorbani , Orhan Firat, Markus Freitag, Ankur Bapna, Maxim Krikun, Xavier Garcia, Ciprian Chelba & Colin Cherry Google AI |
| Pseudocode | No | No pseudocode or algorithm blocks are present in the paper. |
| Open Source Code | No | The paper states 'We release generated text from all models used in this study.' but does not mention releasing the source code for the methodology. |
| Open Datasets | No | We use in-house web-crawled training datasets with around 2.2 billion sentence pairs (approximately 55 billion tokens) for English German and 781 million sentence pairs for English Chinese. |
| Dataset Splits | No | The paper mentions 'validation data' and 'test sets' used for evaluation during training, but does not specify the exact split percentages, sample counts, or detailed methodology for partitioning the data into train/validation/test sets from the main training corpus. It describes the size of test sets, but not how they are split from the overall dataset. |
| Hardware Specification | Yes | We focus our study on large-scale models: our smallest models require 200 TPUv3 days to train to convergence while our largest models require 2700 TPUv3 days of training. |
| Software Dependencies | No | The paper mentions 'Adafactor optimizer (Shazeer & Stern, 2018)' and 'Moses scorer' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | Models are trained with per-token cross-entropy loss and Adafactor optimizer (Shazeer & Stern, 2018). All models are trained with a fixed batch-size of 500k tokens and dropout rate of 0.1 for residuals, feed-forward activations and attention. All models are trained to near convergence for 500k training steps. We use a sentence-piece vocabulary of size 32000. Regularization: We use a dropout of 0.1 for residuals, feed-forward activations and attention. Models are trained with label smoothing of magnitude 0.1. To improve the training stability, all models use logit clipping of 10. Optimizer: We use Adafactor (Shazeer & Stern, 2018) optimizer for training our models. We use 40k linear warm-up steps and an inverse square root learning rate schedule. For Adafactor we used momentum with 0.9 and factored second moment to save memory. |