Fast PMI-Based Word Embedding with Efficient Use of Unobserved Patterns
Authors: Behrouz Haji Soleimani, Stan Matwin7031-7038
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We have trained various word embedding algorithms on articles of Wikipedia with 2.1 billion tokens and show that our method outperforms the state-of-the-art in most word similarity tasks by a good margin. We have trained our algorithm as well as several others on the articles of Wikipedia and compared the quality of embeddings on various word similarity and analogy tasks. Results show that our algorithm outperforms the state-ofthe-art in most of the tasks. Table 1 compares 14 algorithms on 8 word similarity datasets. The numbers in the table are Pearson s correlation between the rankings provided by the algorithms and the rankings of the human-scoring. |
| Researcher Affiliation | Academia | 1Institute for Big Data Analytics, Faculty of Computer Science, Dalhousie University, Halifax NS, Canada 2Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland behrouz.hajisoleimani@dal.ca, stan@cs.dal.ca |
| Pseudocode | No | The paper describes the algorithm steps in paragraph form, but it does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | KUBWE is implemented in C using the Open MP parallel computing library and the source code can be found on Git Hub1. 1https://github.com/behrouzhs/kubwe |
| Open Datasets | Yes | We have used all the articles of English Wikipedia (dump of March 2016) as the training corpus which has around 2.1 billion tokens after applying a few basic preprocessing steps. |
| Dataset Splits | No | The paper mentions using a 'training corpus' and evaluates on 'word similarity and word analogy tasks' (test sets), but it does not specify any explicit training/validation/test dataset splits or cross-validation setup for the general dataset used for training. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used for running the experiments (e.g., GPU/CPU models, memory, or cloud instances). |
| Software Dependencies | No | KUBWE is implemented in C using the Open MP parallel computing library and the source code can be found on Git Hub1. While it mentions C and Open MP, it does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | In all our experiments, we used α = 0.75 which is known to be a good smoothing factor (Mikolov et al. 2013b). Our proposed algorithm, KUBWE, is trained with p = 13, and the fast KUBWE is trained with k = 3000. Glo Ve is trained with its recommended parameter setting (i.e. xmax = 100). Fast Text is trained with the recommended parameter settings that considers character ngrams of length 3 to 6. CBOW and SGNS are trained with negative sampling set to 5 and 10. |