GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Authors: Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this study, we strive for improving the model quality while being training efficiently. We built a 600 billion parameters sequence-to-sequence Transformer model with Sparsely-Gated Mixture-of-Experts layers... We trained this model with 2048 TPU v3 devices for 4 days... We conducted experiments with various model sizes and found that the translation quality increases as the model gets bigger |
| Researcher Affiliation | Industry | Dmitry Lepikhin lepikhin@google.com Hyouk Joong Lee hyouklee@google.com Yuanzhong Xu yuanzx@google.com Dehao Chen dehao@google.com Orhan Firat orhanf@google.com Yanping Huang huangyp@google.com Maxim Krikun krikun@google.com Noam Shazeer noam@google.com Zhifeng Chen zhifengc@google.com |
| Pseudocode | Yes | Algorithm 1: Group-level top-2 gating with auxiliary loss and Algorithm 2: Forward pass of the Positions-wise Mo E layer. |
| Open Source Code | Yes | We open sourced our example implementation and provided a step by step instruction how to train it on the public cloud provider (https://github. com/tensorflow/lingvo/tree/master/lingvo/tasks/lm). |
| Open Datasets | No | The paper states 'This resulted in approximately 13 billion training examples to be used for model training' and refers to multilingual MT, but it does not provide a specific link, DOI, or explicit statement of public availability for the training data used. |
| Dataset Splits | No | The paper mentions 'approximately 13 billion training examples' and evaluation 'on a held-out test set', but it does not explicitly provide specific percentages, sample counts, or clear predefined splits for training, validation, and test sets. |
| Hardware Specification | Yes | We trained this model with 2048 TPU v3 devices for 4 days |
| Software Dependencies | No | The paper mentions software like TensorFlow and XLA, stating 'We implemented the partitioner in the XLA compiler xla (2019).', but it does not provide specific version numbers for these or other key software components used in the experiments. |
| Experiment Setup | Yes | During training, we use float32 for both model weights and activations in order to ensure training stability. We also ran additional scalability experiments with Mo E(2048E, 60L) with bfloat16 activations with more than one trillion model weights. We used the Adafactor (Shazeer & Stern, 2018) optimizer with a) factored second-moment estimation; b) first moment decay β1 = 0.0; c) second moment decay β2 = 0.99 with 1 t 0.8 schedule; d) clipping threshold of 1.0; and e) 1.0 learning rate with square root decay after 10k training steps. |