Point Set Registration for Unsupervised Bilingual Lexicon Induction

Authors: Hailong Cao, Tiejun Zhao

IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we experimentally test the proposed model in comparison with related methods on the unsupervised bilingual lexicon induction task. We first train source and target word embeddings on source and target monolingual data independently using word2vec.
Researcher Affiliation Academia Hailong Cao and Tiejun Zhao Harbin Institute of Technology caohailong@hit.edu.cn, tjzhao@hit.edu.cn
Pseudocode No The paper describes the approach but does not include pseudocode or an explicit algorithm block.
Open Source Code No The C++ implementation of the CPD algorithm is available at https://github.com/gadomski/cpd. We adapt it for our task. (This refers to a third-party implementation, not their specific adaptation.)
Open Datasets Yes The data for training monolingual word embeddings comes from Wikipedia comparable corpora 2. The French and English text are tokenized and lowercased by scripts from www.statmt.org. All Chinese sentences are segmented by the Stanford Word Segmenter3. Table 1 lists the statistics of the final training data. As the ground truth bilingual lexicons for evaluation, we use the lexicons derived by [Upadhyay et al., 2016] using the Open Multilingual Word Net data released by [Bond and Foster, 2013].
Dataset Splits No The paper does not explicitly state a validation set split or methodology for hyperparameter tuning.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions 'word2vec' and 'Stanford Word Segmenter' but does not specify their version numbers or the version of the linked CPD implementation.
Experiment Setup Yes The dimensionality of all word vectors is 50. The default values are used for all other parameters of word2vec. We retain only top 10k frequent words for each language. We set the parameter σp in Equation 9 as 100. There is not much differences when σp varies from 100 to 500.