Fast PMI-Based Word Embedding with Efficient Use of Unobserved Patterns

Authors: Behrouz Haji Soleimani, Stan Matwin7031-7038

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We have trained various word embedding algorithms on articles of Wikipedia with 2.1 billion tokens and show that our method outperforms the state-of-the-art in most word similarity tasks by a good margin. We have trained our algorithm as well as several others on the articles of Wikipedia and compared the quality of embeddings on various word similarity and analogy tasks. Results show that our algorithm outperforms the state-ofthe-art in most of the tasks. Table 1 compares 14 algorithms on 8 word similarity datasets. The numbers in the table are Pearson s correlation between the rankings provided by the algorithms and the rankings of the human-scoring.
Researcher Affiliation Academia 1Institute for Big Data Analytics, Faculty of Computer Science, Dalhousie University, Halifax NS, Canada 2Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland behrouz.hajisoleimani@dal.ca, stan@cs.dal.ca
Pseudocode No The paper describes the algorithm steps in paragraph form, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes KUBWE is implemented in C using the Open MP parallel computing library and the source code can be found on Git Hub1. 1https://github.com/behrouzhs/kubwe
Open Datasets Yes We have used all the articles of English Wikipedia (dump of March 2016) as the training corpus which has around 2.1 billion tokens after applying a few basic preprocessing steps.
Dataset Splits No The paper mentions using a 'training corpus' and evaluates on 'word similarity and word analogy tasks' (test sets), but it does not specify any explicit training/validation/test dataset splits or cross-validation setup for the general dataset used for training.
Hardware Specification No The paper does not provide any specific details about the hardware used for running the experiments (e.g., GPU/CPU models, memory, or cloud instances).
Software Dependencies No KUBWE is implemented in C using the Open MP parallel computing library and the source code can be found on Git Hub1. While it mentions C and Open MP, it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes In all our experiments, we used α = 0.75 which is known to be a good smoothing factor (Mikolov et al. 2013b). Our proposed algorithm, KUBWE, is trained with p = 13, and the fast KUBWE is trained with k = 3000. Glo Ve is trained with its recommended parameter setting (i.e. xmax = 100). Fast Text is trained with the recommended parameter settings that considers character ngrams of length 3 to 6. CBOW and SGNS are trained with negative sampling set to 5 and 10.