Examining Scaling and Transfer of Language Model Architectures for Machine Translation
Authors: Biao Zhang, Behrooz Ghorbani, Ankur Bapna, Yong Cheng, Xavier Garcia, Jonathan Shen, Orhan Firat
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results show that: (i) Different LMs have different scaling properties, where architectural differences often have a significant impact on model performance at small scales, but the performance gap narrows as the number of parameters increases, (ii) Several design choices, including causal masking and language-modeling objectives for the source sequence, have detrimental effects on translation quality, and (iii) When paired with full-visible masking for source sequences, LMs could perform on par with Enc Dec on supervised bilingual and multilingual translation tasks, and improve greatly on zero-shot directions by facilitating the reduction of off-target translations. |
| Researcher Affiliation | Collaboration | 1School of Informatics, University of Edinburgh 2Google Research. Correspondence to: Biao Zhang <b.zhang@ed.ac.uk>, Orhan Firat <orhanf@google.com>. |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about releasing open-source code or links to a code repository for the described methodology. |
| Open Datasets | Yes | We use WMT14 English-French (En-Fr), WMT14 English-German (En-De), WMT19 English-Chinese (En-Zh) and an in-house webcrawled (Web) En-De dataset for experiments, whose statistics are summarized in Table 2. We also report results on OPUS-100 (Zhang et al., 2020), a massively multilingual corpus containing 100 languages. |
| Dataset Splits | Yes | Table 2: Statistics of different datasets. M/B: million/billion; SO/TO: source-original/target-original test sets; Web: in-house web-crawled datasets; BIL/MUL: the data is used for bilingual/multilingual experiments. Dataset #Samples (Sources) Experiments Train Dev Test BIL MUL WMT14 En-De 4.5M 3000 (WMT13) 3003 (WMT14) WMT14 En-Fr 41M 3000 (WMT13) 3003 (WMT14) WMT19 En-Zh 26M 3981 (WMT18) 1997 (WMT19, SO) 2000 (WMT19 TO) |
| Hardware Specification | No | The paper does not specify the hardware used for running the experiments (e.g., GPU/CPU models, memory details). |
| Software Dependencies | No | The paper mentions 'Adafactor' and 'SentencePiece' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We use Transformer for experiments. By default, we adopt the base setting, with d = 512, dff = 2048 and 8 attention heads. We also work with the Transformer big setting where each hyper-parameter above is doubled. ... We update model parameters via Adafactor (Shazeer & Stern, 2018) with label smoothing of value 0.1, and scheduled learning rate of warmup steps 40K. We apply dropout of 0.1 to residuals, feed-forward activations and attentions. We employ the post-norm Transformer by default; for some exceptional cases (often with deep models where training is unstable) we use the pre-norm one instead. Batch size is set to about 128K tokens. We train models for up to 1M steps on different tasks, except Web En-De where 500K steps is used. We average 10 checkpoints for evaluation. ... Beam search is used for inference, with a beam size of 8 and length penalty of 0.5. |