Distributed Representations of Sentences and Documents
Authors: Quoc Le, Tomas Mikolov
ICML 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results show that Paragraph Vectors outperforms bag-of-words models as well as other techniques for text representations. Finally, we achieve new state-of-the-art results on several text classification and sentiment analysis tasks. We perform experiments to better understand the behavior of the paragraph vectors. To achieve this, we benchmark Paragraph Vector on two text understanding problems that require fixed-length vector representations of paragraphs: sentiment analysis and information retrieval. |
| Researcher Affiliation | Industry | Quoc Le QVL@GOOGLE.COM Tomas Mikolov TMIKOLOV@GOOGLE.COM Google Inc, 1600 Amphitheatre Parkway, Mountain View, CA 94043 |
| Pseudocode | No | The paper describes algorithms and uses figures to illustrate frameworks, but it does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions 'A particular implementation of neural network based algorithm for training the word vectors is available at code.google.com/p/word2vec/ (Mikolov et al., 2013a).' However, this refers to a tool for word vectors (word2vec) that inspired their work, not to the open-source code for their proposed Paragraph Vector method itself. There is no explicit statement or link providing access to the Paragraph Vector source code. |
| Open Datasets | Yes | For sentiment analysis, we use two datasets: Stanford sentiment treebank dataset (Socher et al., 2013b) and IMDB dataset (Maas et al., 2011). The dataset can be downloaded at: http://nlp.Stanford.edu/sentiment/ (Stanford Sentiment Treebank) and http://ai.Stanford.edu/ amaas/data/sentiment/index.html (IMDB dataset). |
| Dataset Splits | Yes | The dataset consists of three sets: 8544 sentences for training, 2210 sentences for test and 1101 sentences for validation (or development). (Stanford Sentiment Treebank) The 100,000 movie reviews are divided into three datasets: 25,000 labeled training instances, 25,000 labeled test instances and 50,000 unlabeled training instances. (IMDB dataset) |
| Hardware Specification | No | On average, our implementation takes 30 minutes to compute the paragraph vectors of the IMDB test set, using a 16 core machine (25,000 documents, each document on average has 230 words). This specifies the number of cores but not specific CPU models, GPU details, or other detailed hardware specifications. |
| Software Dependencies | No | The paper mentions stochastic gradient descent, backpropagation, hierarchical softmax, logistic regression, support vector machines, and K-means, but does not provide specific version numbers for any of the software libraries or frameworks used. |
| Experiment Setup | Yes | In our experiments, we cross validate the window size using the validation set, and the optimal window size is 8. The vector presented to the classifier is a concatenation of two vectors, one from PV-DBOW and one from PV-DM. In PV-DBOW, the learned vector representations have 400 dimensions. In PV-DM, the learned vector representations have 400 dimensions for both words and paragraphs. To predict the 8-th word, we concatenate the paragraph vectors and 7 word vectors. Special characters such as ,.!? are treated as a normal word. If the paragraph has less than 9 words, we pre-pad with a special NULL word symbol. (Stanford Treebank) In particular, we cross validate the window size, and the optimal window size is 10 words. The vector presented to the classifier is a concatenation of two vectors, one from PVDBOW and one from PV-DM. In PV-DBOW, the learned vector representations have 400 dimensions. In PV-DM, the learned vector representations have 400 dimensions for both words and documents. (IMDB) |