reproducibilityindex.ai

A Simple but Tough-to-Beat Baseline for Sentence Embeddings

Authors: Sanjeev Arora, Yingyu Liang, Tengyu Ma

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The unsupervised method Glo Ve+WR improves upon avg-Glo Ve signiﬁcantly by 10% to 30%, and beats the baselines by large margins. It achieves better performance than LSTM and RNN and is comparable to DAN, even though the later three use supervision. This demonstrates the power of this simple method: it can be even stronger than highly-tuned supervisedly trained sophisticated models. Using TF-IDF weighting scheme also improves over the unweighted average, but not as much as our method.The semi-supervised method PSL+WR achieves the best results for four out of the six tasks and is comparable to the best in the rest of two tasks. Overall, it outperforms the avg-PSL baseline and all the supervised models initialized with the same PSL vectors. This demonstrates the advantage of our method over the training for those models.
Researcher Affiliation	Academia	Sanjeev Arora, Yingyu Liang, Tengyu Ma Princeton University {arora,yingyul,tengyu}@cs.princeton.edu
Pseudocode	Yes	Algorithm 1 Sentence Embedding
Open Source Code	Yes	1The code is available on https://github.com/Princeton ML/SIF
Open Datasets	Yes	Datasets. We test our methods on the 22 textual similarity datasets including all the datasets from Sem Eval semantic textual similarity (STS) tasks (2012-2015) (Agirre et al., 2012; 2013; 2014; Agirrea et al., 2015), and the Sem Eval 2015 Twitter task (Xu et al., 2015) and the Sem Eval 2014 Semantic Relatedness task (Marelli et al., 2014). The objective of these tasks is to predict the similarity between two given sentences. The evaluation criterion is the Pearson s coefﬁcient between the predicted scores and the ground-truth scores. ... We consider three tasks: the SICK similarity task, the SICK entailment task, and the Stanford Sentiment Treebank (SST) binary classiﬁcation task (Socher et al., 2013). ... The ﬁrst experiment is for the 3-class classiﬁcation task on the SNLI dataset (Bowman et al., 2015). ... The second experiment is the sentiment analysis task on the IMDB dataset, studied in (Wang & Manning, 2012).
Dataset Splits	Yes	For hyperparameter tuning they used 100k examples sampled from PPDB XXL and trained for 5 epochs. Then after ﬁnding the hyperparameters that maximize Spearman s coefﬁcients on the Pavlick et al. PPDB task, they are trained on the entire XL section of PPDB for 10 epochs. ... The other hyperparameters are enumerated as in (Wieting et al., 2016), and the same validation approach is used to select the ﬁnal values.
Hardware Specification	No	No specific hardware (CPU, GPU models, memory) used for running experiments is mentioned in the paper.
Software Dependencies	No	No specific software dependencies with version numbers are mentioned in the paper.
Experiment Setup	Yes	The weighting parameter a is ﬁxed to 10 3, and the word frequencies p(w) are estimated from the commoncrawl dataset. ... We enumerate a {10 i, 3 10 i : 1 i 5} and use the p(w) estimated on the enwiki dataset. ... For the weighted average, the hyperparameter a is enumerated in {10 i, 3 10 i : 2 i 3}. The other hyperparameters are enumerated as in (Wieting et al., 2016), and the same validation approach is used to select the ﬁnal values.