Self-supervised and Supervised Joint Training for Resource-rich Machine Translation
Authors: Yong Cheng, Wei Wang, Lu Jiang, Wolfgang Macherey
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on two resource-rich translation benchmarks, WMT 14 English German and WMT 14 English-French, demonstrate that our approach achieves substantial improvements over several strong baseline methods and obtains a new state of the art of 46.19 BLEU on English-French when incorporating back translation. |
| Researcher Affiliation | Collaboration | 1Google Research, Google LLC, USA 2Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania. Work done while at Google Research. Correspondence to: Yong Cheng <chengyong@google.com>. |
| Pseudocode | Yes | Algorithm 1 delineates the procedure to compute the final loss L(θ). Specifically, each time, we sample a monolingual sentence for each parallel sentence to circumvent the expensive enumeration in Eq. (8). To speed up the training, we group sentences offline by length in Step 3 (cf. batching data in the supplementary document). |
| Open Source Code | No | The paper does not provide explicit statements or links indicating the release of open-source code for the described methodology. |
| Open Datasets | Yes | We evaluate our approach on two representative, resource-rich translation datasets, WMT 14 English-German and WMT 14 English-French across four translation directions, English German (En De), German English (De En), English French (En Fr), and French English (Fr En). ... The English, German and French monolingual corpora in our experiments come from the WMT 14 translation tasks. |
| Dataset Splits | Yes | The validation set is newstest2013 and the test set is newstest2014. The vocabulary for the English-French dataset is also jointly split into 44K sub-word units. The concatenation of newstest2012 and newstest2013 is used as the validation set while newstest2014 is the test set. |
| Hardware Specification | Yes | We carry out our experiments on a cluster of 128 P100 GPUs and update gradients synchronously. |
| Software Dependencies | No | The paper mentions using the 'Lingvo toolkit' and 'Adam' optimizer but does not specify their version numbers or the versions of other software dependencies. |
| Experiment Setup | Yes | The Transformer models follow the original network settings (Vaswani et al., 2017). In particular, the layer normalization is applied after each residual connection rather than before each sub-layer. The dropout ratios are set to 0.1 for all Transformer models except for the Transformer-big model on English-German where 0.3 is used. We search the hyperparameters using the Transformer-base model on English-German. In our method, the shuffling ratio p1 is set to 0.50 while 0.25 is used for English-French in Table 6. p2 is sampled from a Beta distribution Beta(2, 6). The dropout ratio of A is 0.2 for all the models. For decoding, we use a beam size of 4 and a length penalty of 0.6 for English-German, and a beam size of 5 and a length penalty of 1.0 for English-French. We carry out our experiments on a cluster of 128 P100 GPUs and update gradients synchronously. The model is optimized with Adam (Kingma & Ba, 2014) following the same learning rate schedule used in (Vaswani et al., 2017) except for warmup steps which is set to 4000 for both Transform-base and Transformer-big models. |