$\textit{Read-ME}$: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design
Authors: Ruisi Cai, Yeonju Ro, Geon-Woo Kim, Peihao Wang, Babak Ehteshami Bejnordi, Aditya Akella, Zhangyang "Atlas" Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Read-ME outperforms other popular open-source dense models of similar scales, achieving improvements of up to 10.1% on MMLU, and improving mean end-to-end latency up to 6.1%. Our system achieves a 6.1% reduction in mean latency and a 10% improvement in tail latency compared to state-of-the-art systems. |
| Researcher Affiliation | Collaboration | 1The University of Texas at Austin, 2Qualcomm AI Research |
| Pseudocode | Yes | Algorithm 1 Read-ME Expert-aware Batching Algorithm (pseudocode) |
| Open Source Code | Yes | Codes are available at: https://github.com/VITA-Group/READ-ME. |
| Open Datasets | Yes | Model and Dataset We perform the Mo E refactorization based on Llama2-7B-chat [19] model, a popular open-source model pre-trained on 2 trillion tokens. The training corpus [35] involves the data collected from 7 different resources: Arxiv [20], Books [21], Common Crawl, C4 [36], Github, Wikipedia [37], and Stack Exchange [22]. |
| Dataset Splits | No | The paper mentions using subsets of data (e.g., Red Pajama dataet [35]) and monitoring validation loss during tuning, but it does not provide specific train/validation/test dataset splits (e.g., percentages or exact counts) for its own experimental setup. |
| Hardware Specification | Yes | We use 8 A100 GPUs with 80GB of memory for all tuning experiments. Our setup employs a single A100 GPU with 80GB of memory. |
| Software Dependencies | No | The paper mentions software like Deep Speed inference engine [38] and standard architectures like Transformer, but does not list specific version numbers for ancillary software dependencies such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | Table 2: Hyper-parameter choice during the training. Stage Router Tuning Expert Tuning # Iteration per Round 100 200 # Rounds 8 8 Initial LR at Round 0 5e 4 5e 5 LR Decay within Round Cosine Cosine LR Decay type across Rounds Exponential Exponential LR Decay rate across Rounds 0.8 0.8 Weight Decay 0.01 0.01 Batch Size 64 128 Sequence Length 4096 4096 # Tokens per Round 26 M 105 M # Tokens in Total 1.04 B |