Towards Universal Paraphrastic Sentence Embeddings

Authors: John Wieting, Mohit Bansal, Kevin Gimpel, Karen Livescu

ICLR 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare six compositional architectures, evaluating them on annotated textual similarity datasets drawn both from the same distribution as the training data and from a wide range of other domains. We find that the most complex architectures, such as long short-term memory (LSTM) recurrent neural networks, perform best on the in-domain data. However, in out-of-domain scenarios, simple architectures such as word averaging vastly outperform LSTMs.
Researcher Affiliation Academia John Wieting Mohit Bansal Kevin Gimpel Karen Livescu Toyota Technological Institute at Chicago, Chicago, IL, 60637, USA {jwieting,mbansal,kgimpel,klivescu}@ttic.edu
Pseudocode No The paper describes mathematical formulations of models but does not present any structured pseudocode or algorithm blocks.
Open Source Code Yes Trained models and code for training and evaluation are available at http://ttic.uchicago.edu/~wieting.
Open Datasets Yes Our training data consists of (possibly noisy) pairs taken directly from the original Paraphrase Database (PPDB) and we optimize a margin-based loss.
Dataset Splits Yes However, for hyperparameter tuning we only used 100k examples sampled from PPDB XXL and trained for 5 epochs. Then after finding the hyperparameters that maximize Spearman’s ρ on the Pavlick et al. PPDB task, we trained on the entire XL section of PPDB for 10 epochs.
Hardware Specification No The paper states: 'We would also like to thank the developers of Theano (Bergstra et al., 2010; Bastien et al., 2012) and thank NVIDIA Corporation for donating GPUs used in this research.' However, it does not specify the model or type of GPUs or any other hardware components used for the experiments.
Software Dependencies No The paper mentions using Theano and refers to optimizers like Ada Grad and Adam, and toolkits such as Stanford Core NLP and NLTK, but does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes Our models have the following tunable hyperparameters: λc, the L2 regularizer on the compositional parameters Wc (not applicable for the word averaging model), the pool of phrases used to obtain negative examples (coupled with mini-batch size B, to reduce the number of tunable hyperparameters), λw, the regularizer on the word embeddings, and δ, the margin. We also tune over optimization method (either Ada Grad (Duchi et al., 2011) or Adam (Kingma & Ba, 2014)), learning rate (from {0.05, 0.005, 0.0005}), whether to clip the gradients with threshold 1 (Pascanu et al., 2012), and whether to use MIX or MAX sampling.