reproducibilityindex.ai

Cross-lingual Retrieval for Iterative Self-Supervised Training

Authors: Chau Tran, Yuqing Tang, Xian Li, Jiatao Gu

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using this method, we achieved state-of-the-art unsupervised machine translation results on 9 language directions with an average improvement of 2.4 BLEU, and on the Tatoeba sentence retrieval task in the XTREME benchmark on 16 languages with an average improvement of 21.5% in absolute accuracy. Furthermore, CRISS also brings an additional 1.8 BLEU improvement on average compared to m BART, when ﬁnetuned on supervised machine translation downstream tasks. Our code and pretrained models are publicly available. 1 and 5 Experiment Evaluation
Researcher Affiliation	Industry	Chau Tran Facebook AI chau@fb.com Yuqing Tang Facebook AI yuqtang@fb.com Xian Li Facebook AI xianl@fb.com Jiatao Gu Facebook AI jgu@fb.com
Pseudocode	Yes	Algorithm 1 Unsupervised Parallel Data Mining and Algorithm 2 CRISS training
Open Source Code	Yes	Our code and pretrained models are publicly available. 1https://github.com/pytorch/fairseq/blob/master/examples/criss
Open Datasets	Yes	We pretrained an m BART model with Common Crawl dataset constrained to the 25 languages as in [27]... We use the TED58 dataset which contains multi-way translations of TED talks in 58 languages [34]... We use the Tatoeba dataset [6] to evaluate the cross-lingual alignment quality of CRISS model following the evaluation procedure speciﬁed in the XTREME benchmark [18]... For English-French we use WMT 14, for English German and English-Romanian we use WMT 16 test data, and for English-Nepali and English Sinhala we use Flores test set [16].
Dataset Splits	Yes	In each iteration, we tune the margin score threshold based on validation BLEU on a sampled validation set of size 2000.
Hardware Specification	No	No. The paper does not specify any particular GPU models, CPU models, or other hardware specifications used for running the experiments.
Software Dependencies	No	No. The paper mentions using Fairseq library [30] and mosesdecoder script [4], but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	We set the K = 5 for the KNN neighborhood retrieval for the margin score functions (Equation 2). In each iteration, we tune the margin score threshold based on validation BLEU on a sampled validation set of size 2000... With the mined 180 directions parallel data, we then train the multilingual transformer model for maximum 20, 000 steps using label-smoothed cross-entropy loss as described in Algorithm 2. We sweep for the best maximum learning rate using validation BLEUs... For all directions, we use 0.3 dropout rate, 0.2 label smoothing, 2500 learning rate warm-up steps, 3e 5 maximum learning rate. We use a maximum of 40K training steps, and ﬁnal models are selected based on best valid loss.