Towards Universal Paraphrastic Sentence Embeddings
Authors: John Wieting, Mohit Bansal, Kevin Gimpel, Karen Livescu
ICLR 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare six compositional architectures, evaluating them on annotated textual similarity datasets drawn both from the same distribution as the training data and from a wide range of other domains. We find that the most complex architectures, such as long short-term memory (LSTM) recurrent neural networks, perform best on the in-domain data. However, in out-of-domain scenarios, simple architectures such as word averaging vastly outperform LSTMs. |
| Researcher Affiliation | Academia | John Wieting Mohit Bansal Kevin Gimpel Karen Livescu Toyota Technological Institute at Chicago, Chicago, IL, 60637, USA {jwieting,mbansal,kgimpel,klivescu}@ttic.edu |
| Pseudocode | No | The paper describes mathematical formulations of models but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Trained models and code for training and evaluation are available at http://ttic.uchicago.edu/~wieting. |
| Open Datasets | Yes | Our training data consists of (possibly noisy) pairs taken directly from the original Paraphrase Database (PPDB) and we optimize a margin-based loss. |
| Dataset Splits | Yes | However, for hyperparameter tuning we only used 100k examples sampled from PPDB XXL and trained for 5 epochs. Then after finding the hyperparameters that maximize Spearman’s ρ on the Pavlick et al. PPDB task, we trained on the entire XL section of PPDB for 10 epochs. |
| Hardware Specification | No | The paper states: 'We would also like to thank the developers of Theano (Bergstra et al., 2010; Bastien et al., 2012) and thank NVIDIA Corporation for donating GPUs used in this research.' However, it does not specify the model or type of GPUs or any other hardware components used for the experiments. |
| Software Dependencies | No | The paper mentions using Theano and refers to optimizers like Ada Grad and Adam, and toolkits such as Stanford Core NLP and NLTK, but does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | Our models have the following tunable hyperparameters: λc, the L2 regularizer on the compositional parameters Wc (not applicable for the word averaging model), the pool of phrases used to obtain negative examples (coupled with mini-batch size B, to reduce the number of tunable hyperparameters), λw, the regularizer on the word embeddings, and δ, the margin. We also tune over optimization method (either Ada Grad (Duchi et al., 2011) or Adam (Kingma & Ba, 2014)), learning rate (from {0.05, 0.005, 0.0005}), whether to clip the gradients with threshold 1 (Pascanu et al., 2012), and whether to use MIX or MAX sampling. |