reproducibilityindex.ai

Data Scaling Laws in NMT: The Effect of Noise and Architecture

Authors: Yamini Bansal, Behrooz Ghorbani, Ankush Garg, Biao Zhang, Colin Cherry, Behnam Neyshabur, Orhan Firat

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we study the effect of varying the architecture and training data quality on the data scaling properties of Neural Machine Translation (NMT). First, we establish that the test loss of encoder-decoder transformer models scales as a power law in the number of training samples, with a dependence on the model size. Then, we systematically vary aspects of the training setup to understand how they impact the data scaling laws.
Researcher Affiliation	Collaboration	1School of Engineering and Applied Science, Harvard University, MA, USA. Work performed while interning at Google. 2Google, USA 3School of Informatics, University of Edinburgh.
Pseudocode	No	No pseudocode or algorithm blocks are present in the paper. The methodology is described in prose.
Open Source Code	No	No explicit statement about releasing source code or a link to a code repository is provided in the paper.
Open Datasets	Yes	The second set of experiments are conducted with Paracrawl dataset (Ba n on et al., 2020), both with and without filtering applied.
Dataset Splits	No	For the small dataset sizes, the models are trained to early stopping (as measured on the log-perplexity of a held-out development set) and for large dataset sizes they are trained for up to 500K gradient steps.
Hardware Specification	Yes	In all, we produce 20 different data scaling curves, each consisting of 10 different dataset sizes which take 25K TPUv3-hours to produce.
Software Dependencies	No	Models are trained with per-token cross-entropy loss and Adafactor optimizer (Shazeer & Stern, 2018).
Experiment Setup	Yes	All models are trained with a fixed batch-size of 500K tokens and dropout rate of 0.1 for residuals, feed-forward activations and attention. For the small dataset sizes, the models are trained to early stopping (as measured on the log-perplexity of a held-out development set) and for large dataset sizes they are trained for up to 500K gradient steps.