Cross-model Back-translated Distillation for Unsupervised Machine Translation
Authors: Xuan-Phi Nguyen, Shafiq Joty, Thanh-Tung Nguyen, Kui Wu, Ai Ti Aw
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, CBD achieves the state of the art in the WMT 14 English-French, WMT 16 English-German and English-Romanian bilingual unsupervised translation tasks, with BLEU scores of 38.2, 30.1, and 36.3, respectively. |
| Researcher Affiliation | Collaboration | 1Nanyang Technological University 2Institute for Infocomm Research (I2R), A*STAR 3Salesforce Research Asia. Correspondence to: Xuan-Phi Nguyen <nguyenxu002@e.ntu.edu.sg>. |
| Pseudocode | Yes | Algorithm 1 describes the overall CBD training process, where the ordered pair (θα, θβ) is alternated between (θ1, θ2) and (θ2, θ1) |
| Open Source Code | Yes | Code: https://github.com/nxphi47/multiagent_crosstranslate. |
| Open Datasets | Yes | Specifically, we use all of the monolingual data from 2007-2017 WMT News Crawl datasets, which yield 190M, 78M, 309M and 3M sentences for language English (En), French (Fr), German (De) and Romanian (Ro), respectively. ... The IWSLT 13 En-Fr dataset contains 200K sentences for each language. ... The IWSLT 14 En-De dataset contains 160K sentences for each language. |
| Dataset Splits | Yes | We use the IWSLT15.TED.tst2012 set for validation and the IWSLT15.TED.tst2013 set for testing. ... We split it into 95% for training and 5% for validation, and we use IWSLT14.TED.{dev2010, dev2012, tst2010,tst1011, tst2012} for testing. |
| Hardware Specification | Yes | We train the model with a 2K tokens per batch on a 8-GPU system. ... We use a 4-GPU system to train the models. ... trained using only 1 GPU. |
| Software Dependencies | No | The paper mentions software like 'Moses multi-bleu.perl script', 'XLM', 'MASS', 'Transformer', 'Ken LM', 'Byte-Pair Encoding', and 'fast Text' but does not specify their version numbers for reproducibility. |
| Experiment Setup | Yes | We train the model with a 2K tokens per batch on a 8-GPU system. ... Transformers with 6 layers and 1024 model dimensions. ... We follow Lample et al. (2018c) to train the UMT agents with a parameter-shared Transformer (Vaswani et al., 2017) that has 6 layers and 512 dimensions and a batch size of 32 sentences. |