reproducibilityindex.ai

Bilingual Lexicon Induction from Non-Parallel Data with Minimal Supervision

Authors: Meng Zhang, Haoruo Peng, Yang Liu, Huanbo Luan, Maosong Sun

AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments, we ﬁnd the matching mechanism to substantially improve the quality of the bilingual vector space, which in turn allows us to induce better bilingual lexica with seeds as few as 10. (Abstract) ... In our experiments, we show that the matching mechanism substantially improves our system, compared to systems that only exploit seeds in superﬁcial ways. (Introduction) ... Experimental Setup ... Results and Discussion
Researcher Affiliation	Academia	1State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory for Information Science and Technology Department of Computer Science and Technology, Tsinghua University, Beijing, China 2Jiangsu Collaborative Innovation Center for Language Competence, Jiangsu, China 3University of Illinois, Urbana-Champaign
Pseudocode	No	The paper describes its algorithms and models using mathematical equations and textual explanations but does not include any formal pseudocode blocks or algorithm listings.
Open Source Code	Yes	The code of our system is available at http://nlp.csai.tsinghua.edu.cn/ zm/ Embedding Matching. (Introduction)
Open Datasets	Yes	In our experiments, the tested systems induce bilingual lexica from Wikipedia comparable corpora1 on ﬁve language pairs: Chinese-English, Spanish-English, Italian-English, Japanese-Chinese, and Turkish-English. (Experimental Setup) ... 1http://linguatools.org/tools/corpora/wikipedia-comparablecorpora
Dataset Splits	Yes	We reserve 10% of each gold standard lexicon for validation, and the remaining 90% for testing. (Experimental Setup)
Hardware Specification	No	The paper does not specify the hardware (e.g., GPU/CPU models, memory, specific cloud instances) used for running the experiments.
Software Dependencies	No	The paper mentions several software tools used for preprocessing (e.g., 'Open CC', 'THULAC', 'NLTK toolkit', 'Tree Tagger', 'Me Cab', 'LORELEI Language Packs', 'word2vec') but does not provide specific version numbers for any of them.
Experiment Setup	Yes	The monolingual hyperparameters are set as follows: embedding size D is 40; window size is 5; 5 negative samples; subsampling threshold is 10 5; initial learning rate is 0.1; 10 training epochs. ... The seed term weight λs has limited impact as long as its value is not too low to tie up bilingual vector spaces, and we set it to 0.01. The matching threshold ϵ can also be set quite liberally as long as it is sufﬁciently low (in our experiments 0.5)... The matching term weight λm appears to be the most important hyperparameter, so we tune it on the validation set with values in {100, 1000, 10000}. (Hyperparameters)