Sequence to Sequence Learning with Neural Networks
Authors: Ilya Sutskever, Oriol Vinyals, Quoc V Le
NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM s BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. |
| Researcher Affiliation | Industry | Ilya Sutskever Google ilyasu@google.com Oriol Vinyals Google vinyals@google.com Quoc V. Le Google qvl@google.com |
| Pseudocode | No | The paper describes the model and processes via text and equations, but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about the release of their source code, nor does it include a link to a code repository. |
| Open Datasets | Yes | We used the WMT 14 English to French dataset. We trained our models on a subset of 12M sentences consisting of 348M French words and 304M English words, which is a clean selected subset from [29]. We chose this translation task and this specific training set subset because of the public availability of a tokenized training and test set together with 1000-best lists from the baseline SMT [29]. |
| Dataset Splits | No | The paper mentions 'training and test set' but does not specify a separate validation set with explicit sizes, percentages, or methodology for its split. |
| Hardware Specification | No | The paper mentions running experiments 'on a single GPU' and using 'an 8-GPU machine', but does not provide specific details such as GPU model numbers, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions a 'C++ implementation' but does not specify any software dependencies with version numbers (e.g., specific libraries, frameworks, or compiler versions). |
| Experiment Setup | Yes | We used deep LSTMs with 4 layers, with 1000 cells at each layer and 1000 dimensional word embeddings, with an input vocabulary of 160,000 and an output vocabulary of 80,000. We initialized all of the LSTM s parameters with the uniform distribution between -0.08 and 0.08. We used stochastic gradient descent without momentum, with a fixed learning rate of 0.7. After 5 epochs, we begun halving the learning rate every half epoch. We trained our models for a total of 7.5 epochs. We used batches of 128 sequences for the gradient and divided it the size of the batch (namely, 128). |