A Compressed Sensing View of Unsupervised Text Embeddings, Bag-of-n-Grams, and LSTMs
Authors: Sanjeev Arora, Mikhail Khodak, Nikunj Saunshi, Kiran Vodrahalli
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments support these theoretical findings and establish strong, simple, and unsupervised baselines on standard benchmarks that in some cases are state of the art among word-level methods. |
| Researcher Affiliation | Academia | Sanjeev Arora, Mikhail Khodak, Nikunj Saunshi Princeton University {arora,mkhodak,nsaunshi}@cs.princeton.edu Kiran Vodrahalli Columbia University kiran.vodrahalli@columbia.edu |
| Pseudocode | No | No pseudocode or algorithm block explicitly labeled as such was found in the paper. The paper describes methods using mathematical equations and prose. |
| Open Source Code | Yes | Code to reproduce results is provided at https://github.com/NLPrinceton/text_embedding. |
| Open Datasets | Yes | We test classification on MR movie reviews (Pang & Lee, 2005), CR customer reviews (Hu & Liu, 2004), SUBJ subjectivity dataset (Pang & Lee, 2004), MPQA opinion polarity subtask (Wiebe et al., 2005), TREC question classification (Li & Roth, 2002), SST sentiment classification (binary and fine-grained) (Socher et al., 2013), and IMDB movie reviews (Maas et al., 2011). In the main evaluation (Table 1) we use normalized 1600-dimensional Glo Ve embeddings (Pennington et al., 2014) trained on the Amazon Product Corpus (Mc Auley et al., 2015), which are released at http://nlp.cs.princeton.edu/Dis C. |
| Dataset Splits | Yes | The first four are evaluated using 10-fold cross-validation, while the others have train-test splits. In all cases we use logistic regression with ℓ2-regularization determined by cross-validation. |
| Hardware Specification | Yes | Time needed to initialize model, construct document representations, and train a linear classifier on a 16-core compute node. |
| Software Dependencies | No | The paper mentions techniques like logistic regression and specific embeddings (Glo Ve, word2vec) but does not provide specific software names with version numbers for replication. |
| Experiment Setup | Yes | In all cases we use logistic regression with ℓ2-regularization determined by cross-validation. 300-dimensional normalized random vectors are used as a baseline. ... a symmetric window of size 10, a min count of 100, for SN/Glo Ve a cooccurrence cutoff of 1000, and for word2vec a down-sampling frequency cutoff of 10 5 and a negative example setting of 3. |