Learned in Translation: Contextualized Word Vectors

Authors: Bryan McCann, James Bradbury, Caiming Xiong, Richard Socher

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that adding these context vectors (Co Ve) improves performance over using only unsupervised word and character vectors on a wide variety of common NLP tasks: sentiment analysis (SST, IMDb), question classification (TREC), entailment (SNLI), and question answering (SQu AD). For fine-grained sentiment analysis and entailment, Co Ve improves performance of our baseline models to the state of the art.
Researcher Affiliation Industry Bryan Mc Cann bmccann@salesforce.com James Bradbury james.bradbury@salesforce.com Caiming Xiong cxiong@salesforce.com Richard Socher rsocher@salesforce.com
Pseudocode No The paper describes methods using mathematical equations and diagrams (e.g., Figure 1, Figure 2) but does not include structured pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes The Py Torch code at https://github.com/salesforce/cove includes an example of how to generate Co Ve from the MT-LSTM we used in all of our best models.
Open Datasets Yes Our smallest MT dataset comes from the WMT 2016 multi-modal translation shared task [Specia et al., 2016]. The training set consists of 30,000 sentence pairs that briefly describe Flickr captions and is often referred to as Multi30k. Our medium-sized MT dataset is the 2016 version of the machine translation task prepared for the International Workshop on Spoken Language Translation [Cettolo et al., 2015]. Our largest MT dataset comes from the news translation shared task from WMT 2017. We train our model separately on two sentiment analysis datasets: the Stanford Sentiment Treebank (SST) [Socher et al., 2013] and the IMDb dataset [Maas et al., 2011]. For question classification, we use the small TREC dataset [Voorhees and Tice, 1999] dataset of open-domain, fact-based questions divided into broad semantic categories. For entailment, we use the Stanford Natural Language Inference Corpus (SNLI) [Bowman et al., 2015]. The Stanford Question Answering Dataset (SQu AD) [Rajpurkar et al., 2016] is a large-scale question answering dataset with 87,599 training examples, 10,570 development examples, and a test set that is not released to the public.
Dataset Splits Yes IMDb contains 25,000 multi-sentence reviews, which we truncate to the first 200 words. 2,500 reviews are held out for validation. For question classification... We hold out 452 examples for validation and leave 5,000 for training. SNLI, which has 550,152 training, 10,000 validation, and 10,000 testing examples. SQu AD is a large-scale question answering dataset with 87,599 training examples, 10,570 development examples, and a test set that is not released to the public.
Hardware Specification No The paper discusses training models and running experiments but does not specify any hardware details such as CPU or GPU models, memory, or cloud instance types used.
Software Dependencies No The paper mentions 'Py Torch code' but does not provide specific version numbers for PyTorch or any other software dependencies, libraries, or solvers used in the experiments.
Experiment Setup Yes When training an MT-LSTM, we used fixed 300-dimensional word vectors. The hidden size of the LSTMs in all MT-LSTMs is 300. The model was trained with stochastic gradient descent with a learning rate that began at 1 and decayed by half each epoch after the validation perplexity increased for the first time. Dropout with ratio 0.2 was applied to the inputs and outputs of all layers of the encoder and decoder. Models were trained using Adam with α = 0.001. Dropout was applied before all feedforward layers with dropout ratio 0.1, 0.2, or 0.3. Maxout networks pool over 4 channels, reduce dimensionality by 2, 4, or 8, reduce again by 2, and project to the output dimension.