TA-MoE: Topology-Aware Large Scale Mixture-of-Expert Training
Authors: Chang Chen, Min Li, Zhihua Wu, Dianhai Yu, Chao Yang
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that TA-Mo E can substantially outperform its counterparts on various hardware and model configurations, with roughly 1.01x-1.61x, 1.01x4.77x, 1.25x-1.54x improvements over the popular Deep Speed-Mo E, Fast Mo E and Faster Mo E systems. We conduct experiments on various typical network topologies and model configurations. |
| Researcher Affiliation | Collaboration | Chang Chen1 , Min Li2,3 , Zhihua Wu4, Dianhai Yu4, Chao Yang2,3,5 1Center for Data Science, Peking University 2School of Mathematics Sciences, Peking University 3National Engineering Laboratory for Big Data Analysis and Applications, Peking University 4Baidu Inc. 5Institute for Computing and Digital Economy, Peking University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code of TA-Mo E is available at: https://github.com/Chen-Chang/TA-Mo E |
| Open Datasets | Yes | ...train on the open-source openwebtext2 dataset [1]. Openwebtext2. https://openwebtext2.readthedocs.io/en/latest/, 2022. |
| Dataset Splits | No | The paper uses a validation set for evaluation (e.g., 'validation loss w.r.t. steps') but does not specify the dataset split percentages or exact methodology for creating these splits from the overall dataset. |
| Hardware Specification | Yes | For cluster A, each node consists of 8 NVIDIA Tesla 40GB A100 GPUs connected with NVSwitch... Clusters B and C are equipped with 8 NVIDIA Tesla 32GB V100 GPUs in each node. Additionally, Table 2 lists 'Clusters GPU Intra-Node Inter-Node Symmetric Same switch' specifying 'A 40G-A100', 'B 32G-V100', 'C 32G-V100'. |
| Software Dependencies | Yes | Besides, the software configurations are set as CUDA 11.0, NCCL 2.8.4 and CUDA 11.1, NCCL 2.8.3, for cluster A and cluster B, C, respectively. |
| Experiment Setup | Yes | The number of the experts are chosen among {8, 16, 32, 48, 64} with each device deployed with one expert. Both the Switch top-1 [7] and the GShard top-2 gates [11] are tested with the weight of auxiliary loss set as 1.0. For the consistency of the experiment, we implement the models by a single framework Paddle [2] and train on the open-source openwebtext2 dataset [1]. More detailed specifications of model settings can be found in Table 3. Table 3: Detailed specifications of the GPT models. Gate Layers Hidden size Intermediate size Batch size Data type Capacity factor Clusters. |