On the Dimensionality of Word Embedding

Authors: Zi Yin, Yuanyuan Shen

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental All our experiments use the Text8 corpus [Mahoney, 2011], a standard benchmark corpus used for various natural language tasks. We perform this procedure and cross-validate the results with grid search for LSA, skip-gram Word2Vec and Glo Ve on an English corpus. Figure 1b and 1c display the performances (measured by the correlation between vector cosine similarity and human labels) of word embeddings of various dimensionalities from the PPMI LSA algorithm, evaluated on two word correlation tests: Word Sim353 [Finkelstein et al., 2001] and MTurk771 [Halawi et al., 2012].
Researcher Affiliation Collaboration Zi Yin Stanford University s0960974@gmail.com Yuanyuan Shen Microsoft Corp. & Stanford University Yuanyuan.Shen@microsoft.com
Pseudocode No The paper describes mathematical derivations and concepts but does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Code can be found on Git Hub: https://github.com/ziyin-dl/word-embedding-dimensionality-selection
Open Datasets Yes All our experiments use the Text8 corpus [Mahoney, 2011], a standard benchmark corpus used for various natural language tasks.
Dataset Splits No The paper mentions 'cross-validate the results with grid search' but does not specify explicit training/validation/test splits of its primary Text8 corpus for model training, nor does it detail how a validation set was explicitly used during training.
Hardware Specification No The paper does not provide any specific details about the hardware used for running the experiments (e.g., CPU, GPU models, memory).
Software Dependencies No The paper does not mention any specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup No The paper does not provide specific experimental setup details such as hyperparameters (learning rate, batch size, epochs) used for training the LSA, skip-gram, or GloVe embeddings.