Cross-model Back-translated Distillation for Unsupervised Machine Translation

Authors: Xuan-Phi Nguyen, Shafiq Joty, Thanh-Tung Nguyen, Kui Wu, Ai Ti Aw

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, CBD achieves the state of the art in the WMT 14 English-French, WMT 16 English-German and English-Romanian bilingual unsupervised translation tasks, with BLEU scores of 38.2, 30.1, and 36.3, respectively.
Researcher Affiliation Collaboration 1Nanyang Technological University 2Institute for Infocomm Research (I2R), A*STAR 3Salesforce Research Asia. Correspondence to: Xuan-Phi Nguyen <nguyenxu002@e.ntu.edu.sg>.
Pseudocode Yes Algorithm 1 describes the overall CBD training process, where the ordered pair (θα, θβ) is alternated between (θ1, θ2) and (θ2, θ1)
Open Source Code Yes Code: https://github.com/nxphi47/multiagent_crosstranslate.
Open Datasets Yes Specifically, we use all of the monolingual data from 2007-2017 WMT News Crawl datasets, which yield 190M, 78M, 309M and 3M sentences for language English (En), French (Fr), German (De) and Romanian (Ro), respectively. ... The IWSLT 13 En-Fr dataset contains 200K sentences for each language. ... The IWSLT 14 En-De dataset contains 160K sentences for each language.
Dataset Splits Yes We use the IWSLT15.TED.tst2012 set for validation and the IWSLT15.TED.tst2013 set for testing. ... We split it into 95% for training and 5% for validation, and we use IWSLT14.TED.{dev2010, dev2012, tst2010,tst1011, tst2012} for testing.
Hardware Specification Yes We train the model with a 2K tokens per batch on a 8-GPU system. ... We use a 4-GPU system to train the models. ... trained using only 1 GPU.
Software Dependencies No The paper mentions software like 'Moses multi-bleu.perl script', 'XLM', 'MASS', 'Transformer', 'Ken LM', 'Byte-Pair Encoding', and 'fast Text' but does not specify their version numbers for reproducibility.
Experiment Setup Yes We train the model with a 2K tokens per batch on a 8-GPU system. ... Transformers with 6 layers and 1024 model dimensions. ... We follow Lample et al. (2018c) to train the UMT agents with a parameter-shared Transformer (Vaswani et al., 2017) that has 6 layers and 512 dimensions and a batch size of 32 sentences.