Distributed Negative Sampling for Word Embeddings
Authors: Stergios Stergiou, Zygimantas Straznickas, Rolina Wu, Kostas Tsioutsiouliklis
AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We obtain results on a corpus created from the top 2 billion web pages of Yahoo Search, which includes 1.066 trillion words and a dictionary of 1.42 billion words. Training time per epoch is 2 hours for typical hyperparameters. To the best of our knowledge, this is the first work that has been shown to scale to corpora of more than 1 trillion words and out-of-memory dictionaries. We collect experimental results on a cluster of commodity nodes whose configuration is depicted on Table 1. We establish the quality of the embeddings obtained by our algorithms on a composite corpus comprising two news corpora, the 1 Billion Word Language Model Benchmark, the UMBC Web Base corpus and Wikipedia. We report quality results on the Google Analogy evaluation dataset (Mikolov et al. 2013) for the BNS, SNS and TNS algorithms, as well as the reference word2vec implementation, on Table 2. |
| Researcher Affiliation | Collaboration | Stergios Stergiou Yahoo Research stergios@yahoo-inc.com Zygimantas Straznickas MIT zygi@mit.edu Rolina Wu University of Waterloo rolina.wu@uwaterloo.ca Kostas Tsioutsiouliklis Yahoo Research kostas@yahoo-inc.com |
| Pseudocode | Yes | Algorithm 1 SGNS Word2Vec; Algorithm 2 Alias Method; Algorithm 3 Hierarchical Sampling; Algorithm 4 Baseline Negative Sampling; Algorithm 5 Single Negative Sampling; Algorithm 6 Target Negative Sampling. |
| Open Source Code | No | The paper mentions 'Chronos is a proprietary in-memory / secondary-storage hybrid graph processing system' and references other open-source implementations (Gensim, TensorFlow, Medallia, Deeplearning4j, MLLib) but does not provide a link or statement that their own implementation's code is open source. |
| Open Datasets | Yes | We obtain results on a corpus created from the top 2 billion web pages of Yahoo Search, which includes 1.066 trillion words and a dictionary of 1.42 billion words. We establish the quality of the embeddings obtained by our algorithms on a composite corpus comprising two news corpora12, the 1 Billion Word Language Model Benchmark3 (Chelba et al. 2013), the UMBC Web Base corpus4 (Han et al. 2013) and Wikipedia5. |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits, nor does it refer to predefined splits with citations for the main corpora used for training. It mentions evaluation on the Google Analogy Dataset, which typically functions as a test set. |
| Hardware Specification | Yes | We collect experimental results on a cluster of commodity nodes whose configuration is depicted on Table 1. Table 1: Cluster Node Configuration CPU 2x Intel Xeon E5-2620 Frequency 2.5GHz (max) RAM 64GB Memory Bandwidth 42.6 GB/s (max) Network 10Gbps Ethernet |
| Software Dependencies | No | The paper states their system is developed in 'C++11 on top of Hadoop Map Reduce' but does not provide specific version numbers for these or other software libraries/dependencies. |
| Experiment Setup | Yes | Accuracy Results on Google Analogy Dataset for Single-Epoch Trained Embeddings on the Composite Corpus for d = 300, win = 5, neg = 5, t = 10 5, min count = 10 and threads = 20. 350 partitions are used for BNS, SNS and TNS. |