Contrastive Clustering to Mine Pseudo Parallel Data for Unsupervised Translation
Authors: Xuan-Phi Nguyen, Hongyu Gong, Yun Tang, Changhan Wang, Philipp Koehn, Shafiq Joty
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method achieves the state of the art in the WMT 14 English-French, WMT 16 German-English and English-Romanian bilingual unsupervised translation tasks, with 40.2, 36.8, and 37.0 BLEU, respectively. |
| Researcher Affiliation | Collaboration | Meta AI Nanyang Technological University Johns Hopkins University |
| Pseudocode | Yes | Algorithm 1 Sinkhorn: Given matrix Z RB K, which represents the after-exponential latent representations of batches of samples, and n number of iterations; return the sinkhorn prototype output Q RB K. |
| Open Source Code | Yes | 1Code: https://github.com/nxphi47/fairseq/tree/swav umt |
| Open Datasets | Yes | For the WMT 14 English-French (En-Fr), WMT 16 English-German (En-De) and WMT 16 English-Romanian (En-Ro) bilingual UMT tasks, we follow the established predecessors (Lample et al., 2018c; Conneau & Lample, 2019; Song et al., 2019; Nguyen et al., 2021) to use only the monolingual data from 2007-2017 WMT News Crawl datasets of the two languages for each task. |
| Dataset Splits | No | The paper mentions using a 'validation set' for certain metrics (e.g., Global Accuracy) and 'held-out' data for visualizations, but does not specify the full training/validation/test dataset splits with explicit percentages, sample counts, or references to predefined splits for the main UMT tasks. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models or types of computing resources used for the experiments. |
| Software Dependencies | No | The paper mentions software like 'Moses multi-bleu.perl script', 'sacrebleu', and 'sentencepiece tokenizer model', but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We set mininum, maximum lengths of Lmin = 5 and Lmax = 300; source/target length ratio µ 1.5; maximum overlap ratio γi = 0.35 and accept only the top ρ = 5% of mined pairs. The agreement BLEU threshold is β = 30 |