Contrastive Clustering to Mine Pseudo Parallel Data for Unsupervised Translation

Authors: Xuan-Phi Nguyen, Hongyu Gong, Yun Tang, Changhan Wang, Philipp Koehn, Shafiq Joty

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method achieves the state of the art in the WMT 14 English-French, WMT 16 German-English and English-Romanian bilingual unsupervised translation tasks, with 40.2, 36.8, and 37.0 BLEU, respectively.
Researcher Affiliation Collaboration Meta AI Nanyang Technological University Johns Hopkins University
Pseudocode Yes Algorithm 1 Sinkhorn: Given matrix Z RB K, which represents the after-exponential latent representations of batches of samples, and n number of iterations; return the sinkhorn prototype output Q RB K.
Open Source Code Yes 1Code: https://github.com/nxphi47/fairseq/tree/swav umt
Open Datasets Yes For the WMT 14 English-French (En-Fr), WMT 16 English-German (En-De) and WMT 16 English-Romanian (En-Ro) bilingual UMT tasks, we follow the established predecessors (Lample et al., 2018c; Conneau & Lample, 2019; Song et al., 2019; Nguyen et al., 2021) to use only the monolingual data from 2007-2017 WMT News Crawl datasets of the two languages for each task.
Dataset Splits No The paper mentions using a 'validation set' for certain metrics (e.g., Global Accuracy) and 'held-out' data for visualizations, but does not specify the full training/validation/test dataset splits with explicit percentages, sample counts, or references to predefined splits for the main UMT tasks.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models or types of computing resources used for the experiments.
Software Dependencies No The paper mentions software like 'Moses multi-bleu.perl script', 'sacrebleu', and 'sentencepiece tokenizer model', but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We set mininum, maximum lengths of Lmin = 5 and Lmax = 300; source/target length ratio µ 1.5; maximum overlap ratio γi = 0.35 and accept only the top ρ = 5% of mined pairs. The agreement BLEU threshold is β = 30