Data Scaling Laws in NMT: The Effect of Noise and Architecture
Authors: Yamini Bansal, Behrooz Ghorbani, Ankush Garg, Biao Zhang, Colin Cherry, Behnam Neyshabur, Orhan Firat
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we study the effect of varying the architecture and training data quality on the data scaling properties of Neural Machine Translation (NMT). First, we establish that the test loss of encoder-decoder transformer models scales as a power law in the number of training samples, with a dependence on the model size. Then, we systematically vary aspects of the training setup to understand how they impact the data scaling laws. |
| Researcher Affiliation | Collaboration | 1School of Engineering and Applied Science, Harvard University, MA, USA. Work performed while interning at Google. 2Google, USA 3School of Informatics, University of Edinburgh. |
| Pseudocode | No | No pseudocode or algorithm blocks are present in the paper. The methodology is described in prose. |
| Open Source Code | No | No explicit statement about releasing source code or a link to a code repository is provided in the paper. |
| Open Datasets | Yes | The second set of experiments are conducted with Paracrawl dataset (Ba n on et al., 2020), both with and without filtering applied. |
| Dataset Splits | No | For the small dataset sizes, the models are trained to early stopping (as measured on the log-perplexity of a held-out development set) and for large dataset sizes they are trained for up to 500K gradient steps. |
| Hardware Specification | Yes | In all, we produce 20 different data scaling curves, each consisting of 10 different dataset sizes which take 25K TPUv3-hours to produce. |
| Software Dependencies | No | Models are trained with per-token cross-entropy loss and Adafactor optimizer (Shazeer & Stern, 2018). |
| Experiment Setup | Yes | All models are trained with a fixed batch-size of 500K tokens and dropout rate of 0.1 for residuals, feed-forward activations and attention. For the small dataset sizes, the models are trained to early stopping (as measured on the log-perplexity of a held-out development set) and for large dataset sizes they are trained for up to 500K gradient steps. |