Toward Efficient Inference for Mixture of Experts
Authors: Haiyang Huang, Newsha Ardalani, Anna Sun, Liu Ke, Shruti Bhosale, Hsien-Hsin Lee, Carole-Jean Wu, Benjamin Lee
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that dynamic gating improves maximum throughput by 6.21-11.55 for LM, 5.75-10.98 for MT Encoder and 2.58-5.71 for MT Decoder. It also reduces memory usage by up to 1.36 for LM and up to 1.1 for MT. We implement and evaluate these optimizations for language modeling (LM) and machine translation (MT) tasks. |
| Researcher Affiliation | Collaboration | Haiyang Huang1 Newsha Ardalani2 Anna Sun2 Liu Ke3 Hsien-Hsin S. Lee4 Shruti Bhosale2 Carole-Jean Wu2 Benjamin Lee5 1Duke University 2FAIR at Meta 3Washington University in St. Louis 4Intel Corporation 5 University of Pennsylvania |
| Pseudocode | No | The paper describes methods in narrative text and uses diagrams (e.g., Figure 2) but does not include explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/ hyhuang00/moe_inference. |
| Open Datasets | Yes | Language Modeling (LM). We use three domains Wikipedia, Pub Med, Github, from the PILE dataset [22] as input following [8]. Machine Translation (MT). We evaluate expert activation by performing translation from English to French, Japanese, and Austrian using validation data from NLLB-200 [9]. |
| Dataset Splits | No | For Language Modeling (LM), the paper uses 'three domains Wikipedia, Pub Med, Github, from the PILE dataset [22] as input' without specifying explicit training, validation, or test splits. For Machine Translation (MT), it mentions using 'validation data from NLLB-200 [9]', which implies a pre-defined split, but does not detail the specific data partitioning for their own experiments beyond that. |
| Hardware Specification | Yes | Table 2 details our experimental clusters. We use Apple to characterize Mo E workloads (Table 1) and study the impact of our proposed optimizations. ... Cluster Apple: CPU 2 Intel Xeon E5-2698 v4 with 700GB memory, GPU 8 NVIDIA Tesla V100, 32GB memory... Cluster Pear: CPU 2 Intel Xeon Gold 5317 with 64GB memory, GPU 4 NVIDIA RTX A5000, with 24GB memory. |
| Software Dependencies | No | The paper mentions software like Fairseq, Tutel, Faster MoE, Megablock, and Py Torch Profiler, but does not provide specific version numbers for these or other underlying software dependencies (e.g., Python, PyTorch, CUDA) required for reproducibility. |
| Experiment Setup | Yes | The mini batch size is set to 8 for Language Modeling and 48 for Machine Translation, the largest feasible values under baseline. Table 1 details model parameters. The table specifies the number of experts (E), how often a FFN is replaced by an Mo E (M), and each expert s capacity fraction (C). |