Examining Scaling and Transfer of Language Model Architectures for Machine Translation

Authors: Biao Zhang, Behrooz Ghorbani, Ankur Bapna, Yong Cheng, Xavier Garcia, Jonathan Shen, Orhan Firat

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results show that: (i) Different LMs have different scaling properties, where architectural differences often have a significant impact on model performance at small scales, but the performance gap narrows as the number of parameters increases, (ii) Several design choices, including causal masking and language-modeling objectives for the source sequence, have detrimental effects on translation quality, and (iii) When paired with full-visible masking for source sequences, LMs could perform on par with Enc Dec on supervised bilingual and multilingual translation tasks, and improve greatly on zero-shot directions by facilitating the reduction of off-target translations.
Researcher Affiliation Collaboration 1School of Informatics, University of Edinburgh 2Google Research. Correspondence to: Biao Zhang <b.zhang@ed.ac.uk>, Orhan Firat <orhanf@google.com>.
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about releasing open-source code or links to a code repository for the described methodology.
Open Datasets Yes We use WMT14 English-French (En-Fr), WMT14 English-German (En-De), WMT19 English-Chinese (En-Zh) and an in-house webcrawled (Web) En-De dataset for experiments, whose statistics are summarized in Table 2. We also report results on OPUS-100 (Zhang et al., 2020), a massively multilingual corpus containing 100 languages.
Dataset Splits Yes Table 2: Statistics of different datasets. M/B: million/billion; SO/TO: source-original/target-original test sets; Web: in-house web-crawled datasets; BIL/MUL: the data is used for bilingual/multilingual experiments. Dataset #Samples (Sources) Experiments Train Dev Test BIL MUL WMT14 En-De 4.5M 3000 (WMT13) 3003 (WMT14) WMT14 En-Fr 41M 3000 (WMT13) 3003 (WMT14) WMT19 En-Zh 26M 3981 (WMT18) 1997 (WMT19, SO) 2000 (WMT19 TO)
Hardware Specification No The paper does not specify the hardware used for running the experiments (e.g., GPU/CPU models, memory details).
Software Dependencies No The paper mentions 'Adafactor' and 'SentencePiece' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We use Transformer for experiments. By default, we adopt the base setting, with d = 512, dff = 2048 and 8 attention heads. We also work with the Transformer big setting where each hyper-parameter above is doubled. ... We update model parameters via Adafactor (Shazeer & Stern, 2018) with label smoothing of value 0.1, and scheduled learning rate of warmup steps 40K. We apply dropout of 0.1 to residuals, feed-forward activations and attentions. We employ the post-norm Transformer by default; for some exceptional cases (often with deep models where training is unstable) we use the pre-norm one instead. Batch size is set to about 128K tokens. We train models for up to 1M steps on different tasks, except Web En-De where 500K steps is used. We average 10 checkpoints for evaluation. ... Beam search is used for inference, with a beam size of 8 and length penalty of 0.5.