Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning
Authors: Sandeep Subramanian, Adam Trischler, Yoshua Bengio, Christopher J Pal
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that sharing a single recurrent sentence encoder across weakly related tasks leads to consistent improvements over previous methods. |
| Researcher Affiliation | Collaboration | Sandeep Subramanian1,2,3 , Adam Trischler3, Yoshua Bengio1,2,4 & Christopher J Pal1,5 1 Montr eal Institute for Learning Algorithms (MILA) 2 Universit e de Montr eal 3 Microsoft Research Montreal 4 CIFAR Senior Fellow 5 Ecole Polytechnique de Montr eal |
| Pseudocode | Yes | Our approach is described formally in the Algorithm below. Require: A set of k tasks with a common source language, a shared encoder E across all tasks and a set of k task specific decoders D1 . . . Dk. Let θ denote each model s parameters, α a probability vector (p1 . . . pk) denoting the probability of sampling a task such that Σk i pi = 1, datasets for each task IP1 . . . IPk and a loss function L. while θ has not converged do 1: Sample task i Cat(k, α). 2: Sample input, output pairs x, y IPi. 3: Input representation hx Eθ(x). 4: Prediction y Diθ(hx) 5: θ Adam( θL(y, y)). Figure 1 |
| Open Source Code | No | The paper does not provide an explicit statement or link to its own open-source code for the described methodology. It only links to third-party resources or benchmarks. |
| Open Datasets | Yes | We use the Book Corpus (Zhu et al., 2015). We use a parallel corpus of around 4.5 million English-German (De) sentence pairs from WMT15 and 40 million English-French (Fr) sentence pairs from WMT14. We train on 3 million weakly labeled parses obtained by parsing a random subset of the 1-billion word corpus with the Puck GPU parser1 along with gold parses from sections 0-21 of the WSJ section of Penn Treebank. We train on a collection of about 1 million sentence pairs from the SNLI (Bowman et al., 2015) and Multi NLI (Williams et al., 2017) corpora. We use 113k training images from MSCOCO with 5k images for validation and 5k for testing. We also evaluate on the recently published Quora duplicate question dataset5 since it is an order of magnitude larger than the others (approximately 400,000 question pairs). |
| Dataset Splits | Yes | We use 113k training images from MSCOCO with 5k images for validation and 5k for testing. In addition to performing 10-fold cross-validation to determine the L2 regularization penalty on the logistic regression models, we also tune the way in which our sentence representations are generated from the hidden states corresponding to words in a sentence. |
| Hardware Specification | Yes | Models were trained for 7 days on an Nvidia Tesla P100-SXM2-16GB GPU. We thank NVIDIA for donating a DGX-1 computer used in this work |
| Software Dependencies | No | The paper mentions "Py Torch development team (Paszke et al., 2017)" and "Adam Kingma & Ba (2014) optimizer" but does not provide specific version numbers for PyTorch, Adam, or any other software dependencies. |
| Experiment Setup | Yes | All models use word embeddings of 512 dimensions and GRUs with either 1500 or 2048 hidden units. We used minibatches of 48 examples and the Adam Kingma & Ba (2014) optimizer with a learning rate of 0.002. Models were trained for 7 days on an Nvidia Tesla P100-SXM2-16GB GPU. For natural language inference, ... a single layer MLP with a dropout (Srivastava et al., 2014) rate of 0.3. |