Sequence to Sequence Learning with Neural Networks

Authors: Ilya Sutskever, Oriol Vinyals, Quoc V Le

NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM s BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset.
Researcher Affiliation Industry Ilya Sutskever Google ilyasu@google.com Oriol Vinyals Google vinyals@google.com Quoc V. Le Google qvl@google.com
Pseudocode No The paper describes the model and processes via text and equations, but does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about the release of their source code, nor does it include a link to a code repository.
Open Datasets Yes We used the WMT 14 English to French dataset. We trained our models on a subset of 12M sentences consisting of 348M French words and 304M English words, which is a clean selected subset from [29]. We chose this translation task and this specific training set subset because of the public availability of a tokenized training and test set together with 1000-best lists from the baseline SMT [29].
Dataset Splits No The paper mentions 'training and test set' but does not specify a separate validation set with explicit sizes, percentages, or methodology for its split.
Hardware Specification No The paper mentions running experiments 'on a single GPU' and using 'an 8-GPU machine', but does not provide specific details such as GPU model numbers, CPU types, or memory specifications.
Software Dependencies No The paper mentions a 'C++ implementation' but does not specify any software dependencies with version numbers (e.g., specific libraries, frameworks, or compiler versions).
Experiment Setup Yes We used deep LSTMs with 4 layers, with 1000 cells at each layer and 1000 dimensional word embeddings, with an input vocabulary of 160,000 and an output vocabulary of 80,000. We initialized all of the LSTM s parameters with the uniform distribution between -0.08 and 0.08. We used stochastic gradient descent without momentum, with a fixed learning rate of 0.7. After 5 epochs, we begun halving the learning rate every half epoch. We trained our models for a total of 7.5 epochs. We used batches of 128 sequences for the gradient and divided it the size of the batch (namely, 128).