reproducibilityindex.ai

Mesh-TensorFlow: Deep Learning for Supercomputers

Authors: Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, Blake Hechtman

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We use Mesh-Tensor Flow to implement an efﬁcient data-parallel, model-parallel version of the Transformer [16] sequence-to-sequence model. Using TPU meshes of up to 512 cores, we train Transformer models with up to 5 billion parameters, surpassing state of the art results on WMT 14 Englishto-French translation task and the one-billion-word language modeling benchmark. ... To examine the beneﬁt of scaling the Transformer model in the manner suggested by the previous section, we trained such models on machine translation and language modeling tasks. Results are given in Tables 2 and 3.
Researcher Affiliation	Industry	Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, Hyouk Joong Lee Mingsheng Hong, Cliff Young, Ryan Sepassi, Blake Hechtman Google Brain {noam, ylc, nikip, trandustin, avaswani, penporn, phawkins, hyouklee, hongm, cliffy, rsepassi, blakehechtman}@google.com
Pseudocode	Yes	Algorithm 1 Synchronous data-parallelism with replicated parameters. Each processor maintains a complete copy of all weights W (t). The batch b(t) of training examples for timestep t is partitioned among the set P of processors: b(t) = S p P b(t) p . Below is the computation performed on one processor p P.
Open Source Code	Yes	Mesh-Tensorﬂow is available at https://github.com/tensorﬂow/mesh. ... The Mesh-Tensor Flow library is available at https://github.com/tensorflow/mesh and is under active development.
Open Datasets	Yes	To examine the beneﬁt of scaling the Transformer model in the manner suggested by the previous section, we trained such models on machine translation and language modeling tasks. Results are given in Tables 2 and 3. For the billion-word language modeling benchmark, we trained the models for 10 epochs. ... On the WMT14 En-Fr translation tasks (3), we trained the models for 3 epochs.
Dataset Splits	Yes	For the billion-word language modeling benchmark, we trained the models for 10 epochs. ... Per-word dev-perplexity for the largest model was 24.0, but dropped to 23.5 when the model was evaluated with the logits multiplied by 0.9 (likely due to overﬁtting). ... On the WMT14 En-Fr translation tasks (3), we trained the models for 3 epochs. ... Quality improved with model size, with the largest model achieved BLEU score 43.9 (evaluated using sacrebleu), the best published result to date.
Hardware Specification	Yes	Using TPU meshes of up to 512 cores... on 2dimensional TPUv2 meshes of up to 16x32=512 cores, maintaining computational efﬁciency of over 50% (6 PFLOP/s out of a maximum 11.5 PFLOP/s) on the largest models. ... The largest model (4.9B parameters) took 13 hours to train on a 512-core TPUv2 cluster. ... The largest model (2.9B parameters) was trained for 22 hours on a 128-core TPUv2 cluster.
Software Dependencies	No	The paper mentions 'Tensor Flow' and 'Python library' but does not specify exact version numbers for these or other software dependencies crucial for replication.
Experiment Setup	Yes	For the billion-word language modeling benchmark, we trained the models for 10 epochs. The largest model (4.9B parameters) took 13 hours to train on a 512-core TPUv2 cluster. Batch size for all models was 256 sequences of 256 tokens each (each sequence was the concatenation of multiple training sentences). ... On the WMT14 En-Fr translation tasks (3), we trained the models for 3 epochs. ... Additional details about the conﬁgurations for these experiments are available as part of the tensor2tensor library on github.