reproducibilityindex.ai

Point Set Registration for Unsupervised Bilingual Lexicon Induction

Authors: Hailong Cao, Tiejun Zhao

IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we experimentally test the proposed model in comparison with related methods on the unsupervised bilingual lexicon induction task. We ﬁrst train source and target word embeddings on source and target monolingual data independently using word2vec.
Researcher Affiliation	Academia	Hailong Cao and Tiejun Zhao Harbin Institute of Technology caohailong@hit.edu.cn, tjzhao@hit.edu.cn
Pseudocode	No	The paper describes the approach but does not include pseudocode or an explicit algorithm block.
Open Source Code	No	The C++ implementation of the CPD algorithm is available at https://github.com/gadomski/cpd. We adapt it for our task. (This refers to a third-party implementation, not their specific adaptation.)
Open Datasets	Yes	The data for training monolingual word embeddings comes from Wikipedia comparable corpora 2. The French and English text are tokenized and lowercased by scripts from www.statmt.org. All Chinese sentences are segmented by the Stanford Word Segmenter3. Table 1 lists the statistics of the ﬁnal training data. As the ground truth bilingual lexicons for evaluation, we use the lexicons derived by [Upadhyay et al., 2016] using the Open Multilingual Word Net data released by [Bond and Foster, 2013].
Dataset Splits	No	The paper does not explicitly state a validation set split or methodology for hyperparameter tuning.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions 'word2vec' and 'Stanford Word Segmenter' but does not specify their version numbers or the version of the linked CPD implementation.
Experiment Setup	Yes	The dimensionality of all word vectors is 50. The default values are used for all other parameters of word2vec. We retain only top 10k frequent words for each language. We set the parameter σp in Equation 9 as 100. There is not much differences when σp varies from 100 to 500.