reproducibilityindex.ai

$\textit{Read-ME}$: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design

Authors: Ruisi Cai, Yeonju Ro, Geon-Woo Kim, Peihao Wang, Babak Ehteshami Bejnordi, Aditya Akella, Zhangyang "Atlas" Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Read-ME outperforms other popular open-source dense models of similar scales, achieving improvements of up to 10.1% on MMLU, and improving mean end-to-end latency up to 6.1%. Our system achieves a 6.1% reduction in mean latency and a 10% improvement in tail latency compared to state-of-the-art systems.
Researcher Affiliation	Collaboration	1The University of Texas at Austin, 2Qualcomm AI Research
Pseudocode	Yes	Algorithm 1 Read-ME Expert-aware Batching Algorithm (pseudocode)
Open Source Code	Yes	Codes are available at: https://github.com/VITA-Group/READ-ME.
Open Datasets	Yes	Model and Dataset We perform the Mo E refactorization based on Llama2-7B-chat [19] model, a popular open-source model pre-trained on 2 trillion tokens. The training corpus [35] involves the data collected from 7 different resources: Arxiv [20], Books [21], Common Crawl, C4 [36], Github, Wikipedia [37], and Stack Exchange [22].
Dataset Splits	No	The paper mentions using subsets of data (e.g., Red Pajama dataet [35]) and monitoring validation loss during tuning, but it does not provide specific train/validation/test dataset splits (e.g., percentages or exact counts) for its own experimental setup.
Hardware Specification	Yes	We use 8 A100 GPUs with 80GB of memory for all tuning experiments. Our setup employs a single A100 GPU with 80GB of memory.
Software Dependencies	No	The paper mentions software like Deep Speed inference engine [38] and standard architectures like Transformer, but does not list specific version numbers for ancillary software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup	Yes	Table 2: Hyper-parameter choice during the training. Stage Router Tuning Expert Tuning # Iteration per Round 100 200 # Rounds 8 8 Initial LR at Round 0 5e 4 5e 5 LR Decay within Round Cosine Cosine LR Decay type across Rounds Exponential Exponential LR Decay rate across Rounds 0.8 0.8 Weight Decay 0.01 0.01 Batch Size 64 128 Sequence Length 4096 4096 # Tokens per Round 26 M 105 M # Tokens in Total 1.04 B