A Compressed Sensing View of Unsupervised Text Embeddings, Bag-of-n-Grams, and LSTMs

Authors: Sanjeev Arora, Mikhail Khodak, Nikunj Saunshi, Kiran Vodrahalli

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments support these theoretical findings and establish strong, simple, and unsupervised baselines on standard benchmarks that in some cases are state of the art among word-level methods.
Researcher Affiliation Academia Sanjeev Arora, Mikhail Khodak, Nikunj Saunshi Princeton University {arora,mkhodak,nsaunshi}@cs.princeton.edu Kiran Vodrahalli Columbia University kiran.vodrahalli@columbia.edu
Pseudocode No No pseudocode or algorithm block explicitly labeled as such was found in the paper. The paper describes methods using mathematical equations and prose.
Open Source Code Yes Code to reproduce results is provided at https://github.com/NLPrinceton/text_embedding.
Open Datasets Yes We test classification on MR movie reviews (Pang & Lee, 2005), CR customer reviews (Hu & Liu, 2004), SUBJ subjectivity dataset (Pang & Lee, 2004), MPQA opinion polarity subtask (Wiebe et al., 2005), TREC question classification (Li & Roth, 2002), SST sentiment classification (binary and fine-grained) (Socher et al., 2013), and IMDB movie reviews (Maas et al., 2011). In the main evaluation (Table 1) we use normalized 1600-dimensional Glo Ve embeddings (Pennington et al., 2014) trained on the Amazon Product Corpus (Mc Auley et al., 2015), which are released at http://nlp.cs.princeton.edu/Dis C.
Dataset Splits Yes The first four are evaluated using 10-fold cross-validation, while the others have train-test splits. In all cases we use logistic regression with ℓ2-regularization determined by cross-validation.
Hardware Specification Yes Time needed to initialize model, construct document representations, and train a linear classifier on a 16-core compute node.
Software Dependencies No The paper mentions techniques like logistic regression and specific embeddings (Glo Ve, word2vec) but does not provide specific software names with version numbers for replication.
Experiment Setup Yes In all cases we use logistic regression with ℓ2-regularization determined by cross-validation. 300-dimensional normalized random vectors are used as a baseline. ... a symmetric window of size 10, a min count of 100, for SN/Glo Ve a cooccurrence cutoff of 1000, and for word2vec a down-sampling frequency cutoff of 10 5 and a negative example setting of 3.