Bilingual Lexicon Induction from Non-Parallel Data with Minimal Supervision

Authors: Meng Zhang, Haoruo Peng, Yang Liu, Huanbo Luan, Maosong Sun

AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we find the matching mechanism to substantially improve the quality of the bilingual vector space, which in turn allows us to induce better bilingual lexica with seeds as few as 10. (Abstract) ... In our experiments, we show that the matching mechanism substantially improves our system, compared to systems that only exploit seeds in superficial ways. (Introduction) ... Experimental Setup ... Results and Discussion
Researcher Affiliation Academia 1State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory for Information Science and Technology Department of Computer Science and Technology, Tsinghua University, Beijing, China 2Jiangsu Collaborative Innovation Center for Language Competence, Jiangsu, China 3University of Illinois, Urbana-Champaign
Pseudocode No The paper describes its algorithms and models using mathematical equations and textual explanations but does not include any formal pseudocode blocks or algorithm listings.
Open Source Code Yes The code of our system is available at http://nlp.csai.tsinghua.edu.cn/ zm/ Embedding Matching. (Introduction)
Open Datasets Yes In our experiments, the tested systems induce bilingual lexica from Wikipedia comparable corpora1 on five language pairs: Chinese-English, Spanish-English, Italian-English, Japanese-Chinese, and Turkish-English. (Experimental Setup) ... 1http://linguatools.org/tools/corpora/wikipedia-comparablecorpora
Dataset Splits Yes We reserve 10% of each gold standard lexicon for validation, and the remaining 90% for testing. (Experimental Setup)
Hardware Specification No The paper does not specify the hardware (e.g., GPU/CPU models, memory, specific cloud instances) used for running the experiments.
Software Dependencies No The paper mentions several software tools used for preprocessing (e.g., 'Open CC', 'THULAC', 'NLTK toolkit', 'Tree Tagger', 'Me Cab', 'LORELEI Language Packs', 'word2vec') but does not provide specific version numbers for any of them.
Experiment Setup Yes The monolingual hyperparameters are set as follows: embedding size D is 40; window size is 5; 5 negative samples; subsampling threshold is 10 5; initial learning rate is 0.1; 10 training epochs. ... The seed term weight λs has limited impact as long as its value is not too low to tie up bilingual vector spaces, and we set it to 0.01. The matching threshold ϵ can also be set quite liberally as long as it is sufficiently low (in our experiments 0.5)... The matching term weight λm appears to be the most important hyperparameter, so we tune it on the validation set with values in {100, 1000, 10000}. (Hyperparameters)