reproducibilityindex.ai

Accessing Higher Dimensions for Unsupervised Word Translation

Authors: Sida Wang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results show that unsupervised translation can be achieved more easily and robustly than previously thought less than 80MB and minutes of CPU time is required to achieve over 50% accuracy for English to Finnish, Hungarian, and Chinese translations when trained in the same domain; even under domain mismatch, the method still works fully unsupervised on English News Crawl to Chinese Wikipedia and English Europarl to Spanish Wikipedia, among others.
Researcher Affiliation	Industry	Sida I. Wang FAIR, Meta
Pseudocode	Yes	Algorithm 1 coocmap self-learning; Algorithm 2 vecmap self-learning
Open Source Code	Yes	code released at https://github.com/facebookresearch/coocmap
Open Datasets	Yes	For training data we use Wikipedia (wiki), Europarl (parl), and News Crawl (news)... wiki (https://dumps.wikimedia.org/): Wikipedia downloaded directly from the official dumps (pages-meta-current), extract text using Wiki Extractor (Attardi, 2015)... parl (https://www.statmt.org/europarl/): Europarl (Koehn, 2005)... news (https://data.statmt.org/news-crawl/): News Crawl 2019.es
Dataset Splits	No	The paper describes using a 'full MUSE dictionary' for evaluation of results, and mentions training data sources, but it does not specify any explicit validation dataset splits or methodology.
Hardware Specification	No	The paper mentions 'minutes of CPU time' and discusses computational complexity in terms of FLOPS, but it does not specify any concrete hardware details such as CPU models, GPU models, or specific machine configurations used for running the experiments.
Software Dependencies	No	The paper mentions software like 'fasttext' (Bojanowski et al., 2017), 'Huggingface Word Level tokenizer', and 'jieba' (for Chinese segmentation) but does not provide specific version numbers for any of these software dependencies.
Experiment Setup	Yes	Each point in the scatter plots represents an experiment where a specific amount of data was taken from the head of the file for training co-occurrence matrices and fasttext vectors with default settings (skipgram, 300 dimension, more in B) for fasttext. coocmap use the same window size as fasttext (m = 5), the same CSLS (k = 10) and same optimization parameters as vecmap. In the main results, we used default parameters, where the important ones were skigram, lr: 0.05, dim: 300, epoch: 5. The learning rate was slowed as 0.1(d/50) 1/2 to account for observed instability in higher dimensions. The epoch was increased to 5 (300/\|D\|)1/2 for data size D in MB to run more epoch on smaller data size.