LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing

Authors: Xiaonan Nie, Liu Qibin, Fangcheng Fu, Shenhan Zhu, Xupeng Miao, Xiaoyang Li, Yang Zhang, Shouda Liu, Bin CUI

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To verify the effectiveness of our methods, we conduct experiments on both language models (e.g., Ro BERTa, GPT, and T5) and vision models (e.g., Swin) for pre-training and fine-tuning tasks. The results demonstrate that our method substantially outperforms its counterparts across different tasks by 1.28 2.2 of speedup.
Researcher Affiliation Collaboration Xiaonan Nie1 Qibin Liu1 Fangcheng Fu1 Shenhan Zhu1 Xupeng Miao2 Xiaoyang Li3 Yang Zhang3 Shouda Liu3 Bin Cui1 1Peking University 2Purdue University 3Byte Dance
Pseudocode Yes Meanwhile, we also outline the workflow of our LSH-Mo E framework in the Algorithm 1 of Appendix A.1.
Open Source Code Yes Additionally, we provide our code in the supplementary material submitted along with the paper.
Open Datasets Yes The Ro BERTa-Mo E model is pre-trained with masked language modeling tasks on a combined dataset, which includes Books Corpus ( 800M words) and English Wikipedia ( 2,500M words). To be specific, we fine-tune two open-sourced models, including the language model GPT-Mo E on the General Language Understanding Evaluation (GLUE) benchmark and the vision model Swin-Mo E on the Image Net classification benchmark.
Dataset Splits Yes We meticulously tracked the time required to achieve equivalent model performance levels (perplexity) during training, as depicted in Figure 6. The perplexity curves are applied 1D Gaussian smoothing with σ = 0.5.
Hardware Specification Yes V100 Cluster. The first hardware environment includes two servers, each outfitted with eight NVIDIA V100 (32GB) GPUs. A100 Cluster. The second hardware environment consists of four servers, each equipped with eight NVIDIA A100 (40GB) GPUs.
Software Dependencies Yes We utilize Py Torch 1.11 for developing the LSH clustering and NCCL for implementing the communication. Our experiments were conducted using a docker image built upon the official NVIDIA GPU containers, which includes Ubuntu 20.04, CUDA 11.3, cu DNN 8.2.0, and NCCL 2.12.7
Experiment Setup Yes In this section, due to the necessity of selecting several hyperparameters for LSH, such as the type of hash function and the quantity of hash functions, we have opted for the cross-polytope hash function based on empirical evaluation, setting the number of hash functions at 6. Model configurations are detailed in Table 1.