The Unreasonable Effectiveness of Few-shot Learning for Machine Translation
Authors: Xavier Garcia, Yamini Bansal, Colin Cherry, George Foster, Maxim Krikun, Melvin Johnson, Orhan Firat
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate these models on the WMT 21 English German and English Chinese news translation task and show that they outperform commercial baselines, and show performance competitive with WMT 21 submissions, which themselves rely on many of the aforementioned techniques. We then verify that our approach works in low-resource scenarios by performing a similar study on the WMT 21 English Icelandic language pair |
| Researcher Affiliation | Industry | 1Google Deep Mind 2Google Translate. Correspondence to: Xavier Garcia <xgarcia@google.com>, Orhan Firat <orhanf@google.com>. |
| Pseudocode | No | The paper describes steps and templates in text but does not include structured pseudocode or algorithm blocks with formal labeling. |
| Open Source Code | No | The paper mentions using open-source tools like JAX, T5X, FLAX, and BLEURT, but does not provide a link to or explicitly state the release of their own source code for the methodology described. |
| Open Datasets | Yes | Our training data consists of a collection of language-specific corpora. For English, we use a similar mix of filtered web pages, Wikipedia, and books as done in Chowdhery et al. (2022). For every other language, we restricted ourselves to only high-quality webpages, using similar filters as the English data. We evaluate these models on the WMT 21 English German and English Chinese news translation task. For this analysis, we use the FRMT dataset (Riley et al., 2022). We focus on the English German language pair of the IWSLT 22 Special Task on Formality Control for Spoken Language Translation (Anastasopoulos et al., 2022). Icelandic... is one of the languages available in WMT 21. We evaluate both the English Icelandic and Icelandic English directions on the Flores (Goyal et al., 2022) devtest sets for Icelandic. |
| Dataset Splits | Yes | To condition the model to perform translation, we use the relevant development set for each language pair as a pool of demonstrations. All our results draw demonstrations from the same development sets as in Vilar et al. (2022), which they refer to as WMT-dev. We evaluate both the English Icelandic and Icelandic English directions on the Flores (Goyal et al., 2022) devtest sets for Icelandic. |
| Hardware Specification | No | No specific hardware (e.g., GPU models, CPU models, or TPU versions) used for running experiments is explicitly mentioned. |
| Software Dependencies | Yes | All the experiments in this work were conducted using JAX (Bradbury et al., 2021), using the T5X framework (Roberts et al., 2022) and FLAX (Heek et al., 2020). We use a Sentencepiece (Kudo & Richardson, 2018) model... We use the Adafactor optimizer (Shazeer & Stern, 2018)... We use the learnt metric BLEURT (Sellam et al., 2020) as our main metric to assess quality. We follow the recommendations from the publicly-available Github page and use the BLEURT-20 checkpoint. |
| Experiment Setup | Yes | We use the exact hyperparameter configurations as their 8 billion parameter model for our main experiments. In particular, we use 32 Transformer layers, with 16 heads, a hidden dimension of 4096, and multi-query attention. The feed-forward size is 16384 and the attention head size is 256. We use 128,000 Sentencepieces. In this work, we use a variant of the UL2 objective (Tay et al., 2022)... We use 2 (instead of 6) separate span corruption instances with (noise density, mean noise span length) given by (0.15, 3) and (0.5, 32) respectively. We mix these objectives randomly, sampling prefix language modeling 20% of the time, causal language modeling 60% of the time, and the remaining span corruption instances 20% of the time. We use a maximum sequence length of 2048, with a batch size of 1024. We use the Adafactor optimizer (Shazeer & Stern, 2018), without using the factorizing option. We use a cosine learning rate decay schedule (Hoffmann et al., 2022), starting at 0.01 and ending at 0.001 at the end of training. This results in 98000 steps for the English-Chinese model, 135000 steps for the English-German model, and 166000 steps for the trilingual model. For low-resource languages, we set learning rate to constant at 0.001. |