reproducibilityindex.ai

Attention is All you Need

Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring signiﬁcantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.
Researcher Affiliation	Collaboration	Ashish Vaswani Google Brain avaswani@google.com Noam Shazeer Google Brain noam@google.com Niki Parmar Google Research nikip@google.com Jakob Uszkoreit Google Research usz@google.com Llion Jones Google Research llion@google.com Aidan N. Gomez University of Toronto aidan@cs.toronto.edu Łukasz Kaiser Google Brain lukaszkaiser@google.com Illia Polosukhin illia.polosukhin@gmail.com
Pseudocode	No	No clearly labeled pseudocode or algorithm blocks were found in the paper.
Open Source Code	Yes	The code we used to train and evaluate our models is available at https://github.com/ tensorflow/tensor2tensor.
Open Datasets	Yes	We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. [...] For English-French, we used the signiﬁcantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary [31].
Dataset Splits	Yes	We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. [...] On the WMT 2014 English-to-German translation task... [...] measuring the change in performance on English-to-German translation on the development set, newstest2013.
Hardware Specification	Yes	We trained our models on one machine with 8 NVIDIA P100 GPUs.
Software Dependencies	No	The paper mentions using the Adam optimizer and providing code on TensorFlow, but does not specify version numbers for any software components or libraries.
Experiment Setup	Yes	We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We trained the base models for a total of 100,000 steps or 12 hours. [...] We used the Adam optimizer [17] with β1 = 0.9, β2 = 0.98 and ϵ = 10 9. We varied the learning rate over the course of training, according to the formula: lrate = d 0.5 model min(step_num 0.5, step_num warmup_steps 1.5) [...] We used warmup_steps = 4000. [...] We apply dropout [27] to the output of each sub-layer... For the base model, we use a rate of Pdrop = 0.1. [...] we employed label smoothing of value ϵls = 0.1 [30].