Neural Word Embedding as Implicit Matrix Factorization
Authors: Omer Levy, Yoav Goldberg
NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the word representations on four dataset, covering word similarity and relational analogy tasks. We used two datasets to evaluate pairwise word similarity: Finkelstein et al.’s Word Sim353 [13] and Bruni et al.’s MEN [4]. These datasets contain word pairs together with human-assigned similarity scores. |
| Researcher Affiliation | Academia | Omer Levy Department of Computer Science Bar-Ilan University omerlevy@gmail.com Yoav Goldberg Department of Computer Science Bar-Ilan University yoav.goldberg@gmail.com |
| Pseudocode | No | The paper describes methods like SGNS and SVD but does not provide them in a structured pseudocode or algorithm block. |
| Open Source Code | Yes | To train the SGNS models, we used a modified version of word2vec which receives a sequence of pre-extracted word-context pairs [18].4 ... 4http://www.bitbucket.org/yoavgo/word2vecf |
| Open Datasets | Yes | All models were trained on English Wikipedia, pre-processed by removing non-textual elements, sentence splitting, and tokenization. The corpus contains 77.5 million sentences, spanning 1.5 billion tokens. |
| Dataset Splits | No | No explicit training/validation/test dataset splits (e.g., percentages, sample counts, or cross-validation setup) are provided for the English Wikipedia corpus used for training. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments are mentioned. |
| Software Dependencies | No | The paper mentions using 'a modified version of word2vec' but does not provide specific version numbers for this or any other software dependencies. |
| Experiment Setup | Yes | All models were derived using a window of 2 tokens to each side of the focus word, ignoring words that appeared less than 100 times in the corpus, resulting in vocabularies of 189,533 terms for both words and contexts. ... We experimented with three values of k (number of negative samples in SGNS, shift parameter in PMI-based methods): 1, 5, 15. |