reproducibilityindex.ai

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Authors: Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this study, we strive for improving the model quality while being training efﬁciently. We built a 600 billion parameters sequence-to-sequence Transformer model with Sparsely-Gated Mixture-of-Experts layers... We trained this model with 2048 TPU v3 devices for 4 days... We conducted experiments with various model sizes and found that the translation quality increases as the model gets bigger
Researcher Affiliation	Industry	Dmitry Lepikhin lepikhin@google.com Hyouk Joong Lee hyouklee@google.com Yuanzhong Xu yuanzx@google.com Dehao Chen dehao@google.com Orhan Firat orhanf@google.com Yanping Huang huangyp@google.com Maxim Krikun krikun@google.com Noam Shazeer noam@google.com Zhifeng Chen zhifengc@google.com
Pseudocode	Yes	Algorithm 1: Group-level top-2 gating with auxiliary loss and Algorithm 2: Forward pass of the Positions-wise Mo E layer.
Open Source Code	Yes	We open sourced our example implementation and provided a step by step instruction how to train it on the public cloud provider (https://github. com/tensorflow/lingvo/tree/master/lingvo/tasks/lm).
Open Datasets	No	The paper states 'This resulted in approximately 13 billion training examples to be used for model training' and refers to multilingual MT, but it does not provide a specific link, DOI, or explicit statement of public availability for the training data used.
Dataset Splits	No	The paper mentions 'approximately 13 billion training examples' and evaluation 'on a held-out test set', but it does not explicitly provide specific percentages, sample counts, or clear predefined splits for training, validation, and test sets.
Hardware Specification	Yes	We trained this model with 2048 TPU v3 devices for 4 days
Software Dependencies	No	The paper mentions software like TensorFlow and XLA, stating 'We implemented the partitioner in the XLA compiler xla (2019).', but it does not provide specific version numbers for these or other key software components used in the experiments.
Experiment Setup	Yes	During training, we use ﬂoat32 for both model weights and activations in order to ensure training stability. We also ran additional scalability experiments with Mo E(2048E, 60L) with bﬂoat16 activations with more than one trillion model weights. We used the Adafactor (Shazeer & Stern, 2018) optimizer with a) factored second-moment estimation; b) ﬁrst moment decay β1 = 0.0; c) second moment decay β2 = 0.99 with 1 t 0.8 schedule; d) clipping threshold of 1.0; and e) 1.0 learning rate with square root decay after 10k training steps.