Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Efficient Vector Representation for Documents through Corruption
Authors: Minmin Chen
ICLR 2017 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Doc2Vec C on a sentiment analysis task, a document classification task and a semantic relatedness task, along with several document representation learning algorithms. |
| Researcher Affiliation | Industry | Minmin Chen Criteo Research Palo Alto, CA 94301, USA EMAIL |
| Pseudocode | No | The paper provides mathematical derivations and descriptions but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | All experiments can be reproduced using the code available at https://github.com/mchen24/iclr2017 |
| Open Datasets | Yes | For sentiment analysis, we use the IMDB movie review dataset. It comes with predefined train/test split (Maas et al., 2011)... We test Doc2Vec C on the Sem Eval 2014 Task 1: semantic relatedness SICK dataset (Marelli et al., 2014). |
| Dataset Splits | Yes | The hyper-parameters are tuned on a validation set subsampled from the training set. ... The set is splitted into a training set of 4,500 instances, a validation set of 500, and a test set of 4,927. |
| Hardware Specification | Yes | The experiments were conducted on a desktop with Intel i7 2.2Ghz cpu. |
| Software Dependencies | No | The paper mentions using a 'linear support vector machine (SVM)' and 't-SNE' for analysis but does not provide specific version numbers for any software libraries or tools. |
| Experiment Setup | Yes | We remove words that appear less than 10 times in the training set... A vector of 4800 dimensions... are generated for each document. In comparison, all the other algorithms produce a vector representation of size 100. ...we used q = 0.9 throughout the experiments. ... We used a cutoff of 100 in this experiment. ...we applied the trick of subsampling of frequent words introduced in (Mikolov & Dean, 2013)... Given the sentence embeddings, we used the exact same training and testing protocol as in (Kiros et al., 2015)... |