Toward Efficient Inference for Mixture of Experts

Authors: Haiyang Huang, Newsha Ardalani, Anna Sun, Liu Ke, Shruti Bhosale, Hsien-Hsin Lee, Carole-Jean Wu, Benjamin Lee

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that dynamic gating improves maximum throughput by 6.21-11.55 for LM, 5.75-10.98 for MT Encoder and 2.58-5.71 for MT Decoder. It also reduces memory usage by up to 1.36 for LM and up to 1.1 for MT. We implement and evaluate these optimizations for language modeling (LM) and machine translation (MT) tasks.
Researcher Affiliation Collaboration Haiyang Huang1 Newsha Ardalani2 Anna Sun2 Liu Ke3 Hsien-Hsin S. Lee4 Shruti Bhosale2 Carole-Jean Wu2 Benjamin Lee5 1Duke University 2FAIR at Meta 3Washington University in St. Louis 4Intel Corporation 5 University of Pennsylvania
Pseudocode No The paper describes methods in narrative text and uses diagrams (e.g., Figure 2) but does not include explicit pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/ hyhuang00/moe_inference.
Open Datasets Yes Language Modeling (LM). We use three domains Wikipedia, Pub Med, Github, from the PILE dataset [22] as input following [8]. Machine Translation (MT). We evaluate expert activation by performing translation from English to French, Japanese, and Austrian using validation data from NLLB-200 [9].
Dataset Splits No For Language Modeling (LM), the paper uses 'three domains Wikipedia, Pub Med, Github, from the PILE dataset [22] as input' without specifying explicit training, validation, or test splits. For Machine Translation (MT), it mentions using 'validation data from NLLB-200 [9]', which implies a pre-defined split, but does not detail the specific data partitioning for their own experiments beyond that.
Hardware Specification Yes Table 2 details our experimental clusters. We use Apple to characterize Mo E workloads (Table 1) and study the impact of our proposed optimizations. ... Cluster Apple: CPU 2 Intel Xeon E5-2698 v4 with 700GB memory, GPU 8 NVIDIA Tesla V100, 32GB memory... Cluster Pear: CPU 2 Intel Xeon Gold 5317 with 64GB memory, GPU 4 NVIDIA RTX A5000, with 24GB memory.
Software Dependencies No The paper mentions software like Fairseq, Tutel, Faster MoE, Megablock, and Py Torch Profiler, but does not provide specific version numbers for these or other underlying software dependencies (e.g., Python, PyTorch, CUDA) required for reproducibility.
Experiment Setup Yes The mini batch size is set to 8 for Language Modeling and 48 for Machine Translation, the largest feasible values under baseline. Table 1 details model parameters. The table specifies the number of experts (E), how often a FFN is replaced by an Mo E (M), and each expert s capacity fraction (C).