Cross-lingual Retrieval for Iterative Self-Supervised Training
Authors: Chau Tran, Yuqing Tang, Xian Li, Jiatao Gu
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using this method, we achieved state-of-the-art unsupervised machine translation results on 9 language directions with an average improvement of 2.4 BLEU, and on the Tatoeba sentence retrieval task in the XTREME benchmark on 16 languages with an average improvement of 21.5% in absolute accuracy. Furthermore, CRISS also brings an additional 1.8 BLEU improvement on average compared to m BART, when finetuned on supervised machine translation downstream tasks. Our code and pretrained models are publicly available. 1 and 5 Experiment Evaluation |
| Researcher Affiliation | Industry | Chau Tran Facebook AI chau@fb.com Yuqing Tang Facebook AI yuqtang@fb.com Xian Li Facebook AI xianl@fb.com Jiatao Gu Facebook AI jgu@fb.com |
| Pseudocode | Yes | Algorithm 1 Unsupervised Parallel Data Mining and Algorithm 2 CRISS training |
| Open Source Code | Yes | Our code and pretrained models are publicly available. 1https://github.com/pytorch/fairseq/blob/master/examples/criss |
| Open Datasets | Yes | We pretrained an m BART model with Common Crawl dataset constrained to the 25 languages as in [27]... We use the TED58 dataset which contains multi-way translations of TED talks in 58 languages [34]... We use the Tatoeba dataset [6] to evaluate the cross-lingual alignment quality of CRISS model following the evaluation procedure specified in the XTREME benchmark [18]... For English-French we use WMT 14, for English German and English-Romanian we use WMT 16 test data, and for English-Nepali and English Sinhala we use Flores test set [16]. |
| Dataset Splits | Yes | In each iteration, we tune the margin score threshold based on validation BLEU on a sampled validation set of size 2000. |
| Hardware Specification | No | No. The paper does not specify any particular GPU models, CPU models, or other hardware specifications used for running the experiments. |
| Software Dependencies | No | No. The paper mentions using Fairseq library [30] and mosesdecoder script [4], but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | We set the K = 5 for the KNN neighborhood retrieval for the margin score functions (Equation 2). In each iteration, we tune the margin score threshold based on validation BLEU on a sampled validation set of size 2000... With the mined 180 directions parallel data, we then train the multilingual transformer model for maximum 20, 000 steps using label-smoothed cross-entropy loss as described in Algorithm 2. We sweep for the best maximum learning rate using validation BLEUs... For all directions, we use 0.3 dropout rate, 0.2 label smoothing, 2500 learning rate warm-up steps, 3e 5 maximum learning rate. We use a maximum of 40K training steps, and final models are selected based on best valid loss. |