Cross-lingual Retrieval for Iterative Self-Supervised Training

Authors: Chau Tran, Yuqing Tang, Xian Li, Jiatao Gu

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using this method, we achieved state-of-the-art unsupervised machine translation results on 9 language directions with an average improvement of 2.4 BLEU, and on the Tatoeba sentence retrieval task in the XTREME benchmark on 16 languages with an average improvement of 21.5% in absolute accuracy. Furthermore, CRISS also brings an additional 1.8 BLEU improvement on average compared to m BART, when finetuned on supervised machine translation downstream tasks. Our code and pretrained models are publicly available. 1 and 5 Experiment Evaluation
Researcher Affiliation Industry Chau Tran Facebook AI chau@fb.com Yuqing Tang Facebook AI yuqtang@fb.com Xian Li Facebook AI xianl@fb.com Jiatao Gu Facebook AI jgu@fb.com
Pseudocode Yes Algorithm 1 Unsupervised Parallel Data Mining and Algorithm 2 CRISS training
Open Source Code Yes Our code and pretrained models are publicly available. 1https://github.com/pytorch/fairseq/blob/master/examples/criss
Open Datasets Yes We pretrained an m BART model with Common Crawl dataset constrained to the 25 languages as in [27]... We use the TED58 dataset which contains multi-way translations of TED talks in 58 languages [34]... We use the Tatoeba dataset [6] to evaluate the cross-lingual alignment quality of CRISS model following the evaluation procedure specified in the XTREME benchmark [18]... For English-French we use WMT 14, for English German and English-Romanian we use WMT 16 test data, and for English-Nepali and English Sinhala we use Flores test set [16].
Dataset Splits Yes In each iteration, we tune the margin score threshold based on validation BLEU on a sampled validation set of size 2000.
Hardware Specification No No. The paper does not specify any particular GPU models, CPU models, or other hardware specifications used for running the experiments.
Software Dependencies No No. The paper mentions using Fairseq library [30] and mosesdecoder script [4], but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes We set the K = 5 for the KNN neighborhood retrieval for the margin score functions (Equation 2). In each iteration, we tune the margin score threshold based on validation BLEU on a sampled validation set of size 2000... With the mined 180 directions parallel data, we then train the multilingual transformer model for maximum 20, 000 steps using label-smoothed cross-entropy loss as described in Algorithm 2. We sweep for the best maximum learning rate using validation BLEUs... For all directions, we use 0.3 dropout rate, 0.2 label smoothing, 2500 learning rate warm-up steps, 3e 5 maximum learning rate. We use a maximum of 40K training steps, and final models are selected based on best valid loss.