A Simple but Tough-to-Beat Baseline for Sentence Embeddings
Authors: Sanjeev Arora, Yingyu Liang, Tengyu Ma
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The unsupervised method Glo Ve+WR improves upon avg-Glo Ve significantly by 10% to 30%, and beats the baselines by large margins. It achieves better performance than LSTM and RNN and is comparable to DAN, even though the later three use supervision. This demonstrates the power of this simple method: it can be even stronger than highly-tuned supervisedly trained sophisticated models. Using TF-IDF weighting scheme also improves over the unweighted average, but not as much as our method.The semi-supervised method PSL+WR achieves the best results for four out of the six tasks and is comparable to the best in the rest of two tasks. Overall, it outperforms the avg-PSL baseline and all the supervised models initialized with the same PSL vectors. This demonstrates the advantage of our method over the training for those models. |
| Researcher Affiliation | Academia | Sanjeev Arora, Yingyu Liang, Tengyu Ma Princeton University {arora,yingyul,tengyu}@cs.princeton.edu |
| Pseudocode | Yes | Algorithm 1 Sentence Embedding |
| Open Source Code | Yes | 1The code is available on https://github.com/Princeton ML/SIF |
| Open Datasets | Yes | Datasets. We test our methods on the 22 textual similarity datasets including all the datasets from Sem Eval semantic textual similarity (STS) tasks (2012-2015) (Agirre et al., 2012; 2013; 2014; Agirrea et al., 2015), and the Sem Eval 2015 Twitter task (Xu et al., 2015) and the Sem Eval 2014 Semantic Relatedness task (Marelli et al., 2014). The objective of these tasks is to predict the similarity between two given sentences. The evaluation criterion is the Pearson s coefficient between the predicted scores and the ground-truth scores. ... We consider three tasks: the SICK similarity task, the SICK entailment task, and the Stanford Sentiment Treebank (SST) binary classification task (Socher et al., 2013). ... The first experiment is for the 3-class classification task on the SNLI dataset (Bowman et al., 2015). ... The second experiment is the sentiment analysis task on the IMDB dataset, studied in (Wang & Manning, 2012). |
| Dataset Splits | Yes | For hyperparameter tuning they used 100k examples sampled from PPDB XXL and trained for 5 epochs. Then after finding the hyperparameters that maximize Spearman s coefficients on the Pavlick et al. PPDB task, they are trained on the entire XL section of PPDB for 10 epochs. ... The other hyperparameters are enumerated as in (Wieting et al., 2016), and the same validation approach is used to select the final values. |
| Hardware Specification | No | No specific hardware (CPU, GPU models, memory) used for running experiments is mentioned in the paper. |
| Software Dependencies | No | No specific software dependencies with version numbers are mentioned in the paper. |
| Experiment Setup | Yes | The weighting parameter a is fixed to 10 3, and the word frequencies p(w) are estimated from the commoncrawl dataset. ... We enumerate a {10 i, 3 10 i : 1 i 5} and use the p(w) estimated on the enwiki dataset. ... For the weighted average, the hyperparameter a is enumerated in {10 i, 3 10 i : 2 i 3}. The other hyperparameters are enumerated as in (Wieting et al., 2016), and the same validation approach is used to select the final values. |