Mesh-TensorFlow: Deep Learning for Supercomputers

Authors: Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, Blake Hechtman

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We use Mesh-Tensor Flow to implement an efficient data-parallel, model-parallel version of the Transformer [16] sequence-to-sequence model. Using TPU meshes of up to 512 cores, we train Transformer models with up to 5 billion parameters, surpassing state of the art results on WMT 14 Englishto-French translation task and the one-billion-word language modeling benchmark. ... To examine the benefit of scaling the Transformer model in the manner suggested by the previous section, we trained such models on machine translation and language modeling tasks. Results are given in Tables 2 and 3.
Researcher Affiliation Industry Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, Hyouk Joong Lee Mingsheng Hong, Cliff Young, Ryan Sepassi, Blake Hechtman Google Brain {noam, ylc, nikip, trandustin, avaswani, penporn, phawkins, hyouklee, hongm, cliffy, rsepassi, blakehechtman}@google.com
Pseudocode Yes Algorithm 1 Synchronous data-parallelism with replicated parameters. Each processor maintains a complete copy of all weights W (t). The batch b(t) of training examples for timestep t is partitioned among the set P of processors: b(t) = S p P b(t) p . Below is the computation performed on one processor p P.
Open Source Code Yes Mesh-Tensorflow is available at https://github.com/tensorflow/mesh. ... The Mesh-Tensor Flow library is available at https://github.com/tensorflow/mesh and is under active development.
Open Datasets Yes To examine the benefit of scaling the Transformer model in the manner suggested by the previous section, we trained such models on machine translation and language modeling tasks. Results are given in Tables 2 and 3. For the billion-word language modeling benchmark, we trained the models for 10 epochs. ... On the WMT14 En-Fr translation tasks (3), we trained the models for 3 epochs.
Dataset Splits Yes For the billion-word language modeling benchmark, we trained the models for 10 epochs. ... Per-word dev-perplexity for the largest model was 24.0, but dropped to 23.5 when the model was evaluated with the logits multiplied by 0.9 (likely due to overfitting). ... On the WMT14 En-Fr translation tasks (3), we trained the models for 3 epochs. ... Quality improved with model size, with the largest model achieved BLEU score 43.9 (evaluated using sacrebleu), the best published result to date.
Hardware Specification Yes Using TPU meshes of up to 512 cores... on 2dimensional TPUv2 meshes of up to 16x32=512 cores, maintaining computational efficiency of over 50% (6 PFLOP/s out of a maximum 11.5 PFLOP/s) on the largest models. ... The largest model (4.9B parameters) took 13 hours to train on a 512-core TPUv2 cluster. ... The largest model (2.9B parameters) was trained for 22 hours on a 128-core TPUv2 cluster.
Software Dependencies No The paper mentions 'Tensor Flow' and 'Python library' but does not specify exact version numbers for these or other software dependencies crucial for replication.
Experiment Setup Yes For the billion-word language modeling benchmark, we trained the models for 10 epochs. The largest model (4.9B parameters) took 13 hours to train on a 512-core TPUv2 cluster. Batch size for all models was 256 sequences of 256 tokens each (each sequence was the concatenation of multiple training sentences). ... On the WMT14 En-Fr translation tasks (3), we trained the models for 3 epochs. ... Additional details about the configurations for these experiments are available as part of the tensor2tensor library on github.