reproducibilityindex.ai

Distributed Negative Sampling for Word Embeddings

Authors: Stergios Stergiou, Zygimantas Straznickas, Rolina Wu, Kostas Tsioutsiouliklis

AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We obtain results on a corpus created from the top 2 billion web pages of Yahoo Search, which includes 1.066 trillion words and a dictionary of 1.42 billion words. Training time per epoch is 2 hours for typical hyperparameters. To the best of our knowledge, this is the ﬁrst work that has been shown to scale to corpora of more than 1 trillion words and out-of-memory dictionaries. We collect experimental results on a cluster of commodity nodes whose conﬁguration is depicted on Table 1. We establish the quality of the embeddings obtained by our algorithms on a composite corpus comprising two news corpora, the 1 Billion Word Language Model Benchmark, the UMBC Web Base corpus and Wikipedia. We report quality results on the Google Analogy evaluation dataset (Mikolov et al. 2013) for the BNS, SNS and TNS algorithms, as well as the reference word2vec implementation, on Table 2.
Researcher Affiliation	Collaboration	Stergios Stergiou Yahoo Research stergios@yahoo-inc.com Zygimantas Straznickas MIT zygi@mit.edu Rolina Wu University of Waterloo rolina.wu@uwaterloo.ca Kostas Tsioutsiouliklis Yahoo Research kostas@yahoo-inc.com
Pseudocode	Yes	Algorithm 1 SGNS Word2Vec; Algorithm 2 Alias Method; Algorithm 3 Hierarchical Sampling; Algorithm 4 Baseline Negative Sampling; Algorithm 5 Single Negative Sampling; Algorithm 6 Target Negative Sampling.
Open Source Code	No	The paper mentions 'Chronos is a proprietary in-memory / secondary-storage hybrid graph processing system' and references other open-source implementations (Gensim, TensorFlow, Medallia, Deeplearning4j, MLLib) but does not provide a link or statement that their own implementation's code is open source.
Open Datasets	Yes	We obtain results on a corpus created from the top 2 billion web pages of Yahoo Search, which includes 1.066 trillion words and a dictionary of 1.42 billion words. We establish the quality of the embeddings obtained by our algorithms on a composite corpus comprising two news corpora12, the 1 Billion Word Language Model Benchmark3 (Chelba et al. 2013), the UMBC Web Base corpus4 (Han et al. 2013) and Wikipedia5.
Dataset Splits	No	The paper does not explicitly provide training/test/validation dataset splits, nor does it refer to predefined splits with citations for the main corpora used for training. It mentions evaluation on the Google Analogy Dataset, which typically functions as a test set.
Hardware Specification	Yes	We collect experimental results on a cluster of commodity nodes whose conﬁguration is depicted on Table 1. Table 1: Cluster Node Conﬁguration CPU 2x Intel Xeon E5-2620 Frequency 2.5GHz (max) RAM 64GB Memory Bandwidth 42.6 GB/s (max) Network 10Gbps Ethernet
Software Dependencies	No	The paper states their system is developed in 'C++11 on top of Hadoop Map Reduce' but does not provide specific version numbers for these or other software libraries/dependencies.
Experiment Setup	Yes	Accuracy Results on Google Analogy Dataset for Single-Epoch Trained Embeddings on the Composite Corpus for d = 300, win = 5, neg = 5, t = 10 5, min count = 10 and threads = 20. 350 partitions are used for BNS, SNS and TNS.