Efficient Vector Representation for Documents through Corruption
Authors: Minmin Chen
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Doc2Vec C on a sentiment analysis task, a document classification task and a semantic relatedness task, along with several document representation learning algorithms. |
| Researcher Affiliation | Industry | Minmin Chen Criteo Research Palo Alto, CA 94301, USA m.chen@criteo.com |
| Pseudocode | No | The paper provides mathematical derivations and descriptions but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | All experiments can be reproduced using the code available at https://github.com/mchen24/iclr2017 |
| Open Datasets | Yes | For sentiment analysis, we use the IMDB movie review dataset. It comes with predefined train/test split (Maas et al., 2011)... We test Doc2Vec C on the Sem Eval 2014 Task 1: semantic relatedness SICK dataset (Marelli et al., 2014). |
| Dataset Splits | Yes | The hyper-parameters are tuned on a validation set subsampled from the training set. ... The set is splitted into a training set of 4,500 instances, a validation set of 500, and a test set of 4,927. |
| Hardware Specification | Yes | The experiments were conducted on a desktop with Intel i7 2.2Ghz cpu. |
| Software Dependencies | No | The paper mentions using a 'linear support vector machine (SVM)' and 't-SNE' for analysis but does not provide specific version numbers for any software libraries or tools. |
| Experiment Setup | Yes | We remove words that appear less than 10 times in the training set... A vector of 4800 dimensions... are generated for each document. In comparison, all the other algorithms produce a vector representation of size 100. ...we used q = 0.9 throughout the experiments. ... We used a cutoff of 100 in this experiment. ...we applied the trick of subsampling of frequent words introduced in (Mikolov & Dean, 2013)... Given the sentence embeddings, we used the exact same training and testing protocol as in (Kiros et al., 2015)... |