TA-MoE: Topology-Aware Large Scale Mixture-of-Expert Training

Authors: Chang Chen, Min Li, Zhihua Wu, Dianhai Yu, Chao Yang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that TA-Mo E can substantially outperform its counterparts on various hardware and model configurations, with roughly 1.01x-1.61x, 1.01x4.77x, 1.25x-1.54x improvements over the popular Deep Speed-Mo E, Fast Mo E and Faster Mo E systems. We conduct experiments on various typical network topologies and model configurations.
Researcher Affiliation Collaboration Chang Chen1 , Min Li2,3 , Zhihua Wu4, Dianhai Yu4, Chao Yang2,3,5 1Center for Data Science, Peking University 2School of Mathematics Sciences, Peking University 3National Engineering Laboratory for Big Data Analysis and Applications, Peking University 4Baidu Inc. 5Institute for Computing and Digital Economy, Peking University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes The code of TA-Mo E is available at: https://github.com/Chen-Chang/TA-Mo E
Open Datasets Yes ...train on the open-source openwebtext2 dataset [1]. Openwebtext2. https://openwebtext2.readthedocs.io/en/latest/, 2022.
Dataset Splits No The paper uses a validation set for evaluation (e.g., 'validation loss w.r.t. steps') but does not specify the dataset split percentages or exact methodology for creating these splits from the overall dataset.
Hardware Specification Yes For cluster A, each node consists of 8 NVIDIA Tesla 40GB A100 GPUs connected with NVSwitch... Clusters B and C are equipped with 8 NVIDIA Tesla 32GB V100 GPUs in each node. Additionally, Table 2 lists 'Clusters GPU Intra-Node Inter-Node Symmetric Same switch' specifying 'A 40G-A100', 'B 32G-V100', 'C 32G-V100'.
Software Dependencies Yes Besides, the software configurations are set as CUDA 11.0, NCCL 2.8.4 and CUDA 11.1, NCCL 2.8.3, for cluster A and cluster B, C, respectively.
Experiment Setup Yes The number of the experts are chosen among {8, 16, 32, 48, 64} with each device deployed with one expert. Both the Switch top-1 [7] and the GShard top-2 gates [11] are tested with the weight of auxiliary loss set as 1.0. For the consistency of the experiment, we implement the models by a single framework Paddle [2] and train on the open-source openwebtext2 dataset [1]. More detailed specifications of model settings can be found in Table 3. Table 3: Detailed specifications of the GPT models. Gate Layers Hidden size Intermediate size Batch size Data type Capacity factor Clusters.