reproducibilityindex.ai

Learned in Translation: Contextualized Word Vectors

Authors: Bryan McCann, James Bradbury, Caiming Xiong, Richard Socher

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that adding these context vectors (Co Ve) improves performance over using only unsupervised word and character vectors on a wide variety of common NLP tasks: sentiment analysis (SST, IMDb), question classiﬁcation (TREC), entailment (SNLI), and question answering (SQu AD). For ﬁne-grained sentiment analysis and entailment, Co Ve improves performance of our baseline models to the state of the art.
Researcher Affiliation	Industry	Bryan Mc Cann bmccann@salesforce.com James Bradbury james.bradbury@salesforce.com Caiming Xiong cxiong@salesforce.com Richard Socher rsocher@salesforce.com
Pseudocode	No	The paper describes methods using mathematical equations and diagrams (e.g., Figure 1, Figure 2) but does not include structured pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	The Py Torch code at https://github.com/salesforce/cove includes an example of how to generate Co Ve from the MT-LSTM we used in all of our best models.
Open Datasets	Yes	Our smallest MT dataset comes from the WMT 2016 multi-modal translation shared task [Specia et al., 2016]. The training set consists of 30,000 sentence pairs that brieﬂy describe Flickr captions and is often referred to as Multi30k. Our medium-sized MT dataset is the 2016 version of the machine translation task prepared for the International Workshop on Spoken Language Translation [Cettolo et al., 2015]. Our largest MT dataset comes from the news translation shared task from WMT 2017. We train our model separately on two sentiment analysis datasets: the Stanford Sentiment Treebank (SST) [Socher et al., 2013] and the IMDb dataset [Maas et al., 2011]. For question classiﬁcation, we use the small TREC dataset [Voorhees and Tice, 1999] dataset of open-domain, fact-based questions divided into broad semantic categories. For entailment, we use the Stanford Natural Language Inference Corpus (SNLI) [Bowman et al., 2015]. The Stanford Question Answering Dataset (SQu AD) [Rajpurkar et al., 2016] is a large-scale question answering dataset with 87,599 training examples, 10,570 development examples, and a test set that is not released to the public.
Dataset Splits	Yes	IMDb contains 25,000 multi-sentence reviews, which we truncate to the ﬁrst 200 words. 2,500 reviews are held out for validation. For question classiﬁcation... We hold out 452 examples for validation and leave 5,000 for training. SNLI, which has 550,152 training, 10,000 validation, and 10,000 testing examples. SQu AD is a large-scale question answering dataset with 87,599 training examples, 10,570 development examples, and a test set that is not released to the public.
Hardware Specification	No	The paper discusses training models and running experiments but does not specify any hardware details such as CPU or GPU models, memory, or cloud instance types used.
Software Dependencies	No	The paper mentions 'Py Torch code' but does not provide specific version numbers for PyTorch or any other software dependencies, libraries, or solvers used in the experiments.
Experiment Setup	Yes	When training an MT-LSTM, we used ﬁxed 300-dimensional word vectors. The hidden size of the LSTMs in all MT-LSTMs is 300. The model was trained with stochastic gradient descent with a learning rate that began at 1 and decayed by half each epoch after the validation perplexity increased for the ﬁrst time. Dropout with ratio 0.2 was applied to the inputs and outputs of all layers of the encoder and decoder. Models were trained using Adam with α = 0.001. Dropout was applied before all feedforward layers with dropout ratio 0.1, 0.2, or 0.3. Maxout networks pool over 4 channels, reduce dimensionality by 2, 4, or 8, reduce again by 2, and project to the output dimension.