Accessing Higher Dimensions for Unsupervised Word Translation
Authors: Sida Wang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results show that unsupervised translation can be achieved more easily and robustly than previously thought less than 80MB and minutes of CPU time is required to achieve over 50% accuracy for English to Finnish, Hungarian, and Chinese translations when trained in the same domain; even under domain mismatch, the method still works fully unsupervised on English News Crawl to Chinese Wikipedia and English Europarl to Spanish Wikipedia, among others. |
| Researcher Affiliation | Industry | Sida I. Wang FAIR, Meta |
| Pseudocode | Yes | Algorithm 1 coocmap self-learning; Algorithm 2 vecmap self-learning |
| Open Source Code | Yes | code released at https://github.com/facebookresearch/coocmap |
| Open Datasets | Yes | For training data we use Wikipedia (wiki), Europarl (parl), and News Crawl (news)... wiki (https://dumps.wikimedia.org/): Wikipedia downloaded directly from the official dumps (pages-meta-current), extract text using Wiki Extractor (Attardi, 2015)... parl (https://www.statmt.org/europarl/): Europarl (Koehn, 2005)... news (https://data.statmt.org/news-crawl/): News Crawl 2019.es |
| Dataset Splits | No | The paper describes using a 'full MUSE dictionary' for evaluation of results, and mentions training data sources, but it does not specify any explicit validation dataset splits or methodology. |
| Hardware Specification | No | The paper mentions 'minutes of CPU time' and discusses computational complexity in terms of FLOPS, but it does not specify any concrete hardware details such as CPU models, GPU models, or specific machine configurations used for running the experiments. |
| Software Dependencies | No | The paper mentions software like 'fasttext' (Bojanowski et al., 2017), 'Huggingface Word Level tokenizer', and 'jieba' (for Chinese segmentation) but does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | Each point in the scatter plots represents an experiment where a specific amount of data was taken from the head of the file for training co-occurrence matrices and fasttext vectors with default settings (skipgram, 300 dimension, more in B) for fasttext. coocmap use the same window size as fasttext (m = 5), the same CSLS (k = 10) and same optimization parameters as vecmap. In the main results, we used default parameters, where the important ones were skigram, lr: 0.05, dim: 300, epoch: 5. The learning rate was slowed as 0.1(d/50) 1/2 to account for observed instability in higher dimensions. The epoch was increased to 5 (300/|D|)1/2 for data size D in MB to run more epoch on smaller data size. |