$\textit{Read-ME}$: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design

Authors: Ruisi Cai, Yeonju Ro, Geon-Woo Kim, Peihao Wang, Babak Ehteshami Bejnordi, Aditya Akella, Zhangyang "Atlas" Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Read-ME outperforms other popular open-source dense models of similar scales, achieving improvements of up to 10.1% on MMLU, and improving mean end-to-end latency up to 6.1%. Our system achieves a 6.1% reduction in mean latency and a 10% improvement in tail latency compared to state-of-the-art systems.
Researcher Affiliation Collaboration 1The University of Texas at Austin, 2Qualcomm AI Research
Pseudocode Yes Algorithm 1 Read-ME Expert-aware Batching Algorithm (pseudocode)
Open Source Code Yes Codes are available at: https://github.com/VITA-Group/READ-ME.
Open Datasets Yes Model and Dataset We perform the Mo E refactorization based on Llama2-7B-chat [19] model, a popular open-source model pre-trained on 2 trillion tokens. The training corpus [35] involves the data collected from 7 different resources: Arxiv [20], Books [21], Common Crawl, C4 [36], Github, Wikipedia [37], and Stack Exchange [22].
Dataset Splits No The paper mentions using subsets of data (e.g., Red Pajama dataet [35]) and monitoring validation loss during tuning, but it does not provide specific train/validation/test dataset splits (e.g., percentages or exact counts) for its own experimental setup.
Hardware Specification Yes We use 8 A100 GPUs with 80GB of memory for all tuning experiments. Our setup employs a single A100 GPU with 80GB of memory.
Software Dependencies No The paper mentions software like Deep Speed inference engine [38] and standard architectures like Transformer, but does not list specific version numbers for ancillary software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup Yes Table 2: Hyper-parameter choice during the training. Stage Router Tuning Expert Tuning # Iteration per Round 100 200 # Rounds 8 8 Initial LR at Round 0 5e 4 5e 5 LR Decay within Round Cosine Cosine LR Decay type across Rounds Exponential Exponential LR Decay rate across Rounds 0.8 0.8 Weight Decay 0.01 0.01 Batch Size 64 128 Sequence Length 4096 4096 # Tokens per Round 26 M 105 M # Tokens in Total 1.04 B