Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
FlowMoE: A Scalable Pipeline Scheduling Framework for Distributed Mixture-of-Experts Training
Authors: Yunqi Gao, Bing Hu, Boloursaz Mashhadi, A-Long Jin, Yanfeng Zhang, Pei Xiao, Rahim Tafazolli, Merouane DEBBAH
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments with 675 typical Mo E layers and four real-world Mo E models across two GPU clusters demonstrate that our proposed Flow Mo E framework outperforms state-of-the-art Mo E training frameworks, reducing training time by 13%-57%, energy consumption by 10%-39%, and memory usage by 7%-32%. |
| Researcher Affiliation | Academia | 1School of Information Science and Electronic Engineering, Zhejiang University 25GIC & 6GIC, Institute for Communication Systems (ICS), University of Surrey 3School of Advanced Technology, Xi an Jiaotong-Liverpool University 4School of Computer Science and Engineering, Northeastern University 5KU 6G Research Center, Department of Computer and Information Engineering, Khalifa University EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Training process in Flow Mo E Algorithm 2 Communication pool management |
| Open Source Code | Yes | Flow Mo E s code is available at https://github.com/ZJU-CNLAB/Flow Mo E. |
| Open Datasets | Yes | GPT2-Tiny-Mo E and Deep Seek-V2-S for the language modeling task on the Open Web Text dataset [39], and BERT-Large-Mo E and LLa MA2-Mo E for the text generation task on the wikitext-103 dataset [40]. |
| Dataset Splits | No | The paper mentions splitting the dataset for pipelining purposes (Algorithm 1, Line 4: Split D for d1, d2, ..., d R into Data Queue;), but it does not specify explicit training, validation, or test splits. It refers to using standard datasets like Open Web Text and wikitext-103 without detailing how these were partitioned for different experimental phases. |
| Hardware Specification | Yes | We use two clusters. (1) Cluster 1 consists of 2 nodes connected with 100Gb/s bandwidth. Each node is equipped with 8 NVIDIA RTX3090 GPUs (24 GB of memory per GPU) connected with PCIe3.0x16. The CPU is Intel Xeon(R) Gold 6248R. (2) Cluster 2 consists of 4 nodes connected with 10Gb/s bandwidth. Each node is equipped with 2 NVIDIA RTX2080Ti GPUs (12 GB of memory per GPU) connected with PCIe switches. The CPU is Intel Xeon(R) Gold 5118. |
| Software Dependencies | No | We implement Flow Mo E as an adaptive and generic framework atop Py Torch. We implement Flow Mo E in Py Torch with its API... In particular, we implement Flow Mo E atop Tutel [12], a highly optimized Mo E acceleration library that is deeply integrated into Py Torch and supports asynchronous execution of communication and computing tasks. Tutel has also been used as a default Mo E training module by Deep Speed [38]. |
| Experiment Setup | Yes | We compare Flow Mo E with Py Torch-based vanilla expert parallelism (vanilla EP) [19] and existing state-of-the-art Mo E frameworks, including Sche Mo E [10], FSMo E [24], Tutel [12] and Faster Mo E [11], with a pipelining degree of R = 2 unless specified otherwise. We use the official implementations of these frameworks. We focus on per-iteration training time, energy consumption and memory usage as performance metrics. All the reported numbers are averaged over 1000 iterations. Table 2: Benchmark models. Mo E Model # Params (MHA+Gating) # Params (Experts) Dataset Configurations L B N M H E/P k f ... (includes specific values for L, B, N, M, H, E/P, k, f for different models) We set the number of experts equal to the number of GPUs (i.e., E = P) and k = 2. Sp becomes the knob that determines the trade-off between optimal scheduling and system overhead. ... We adopt BO to automatically tune Sp during training... In our experiments, BO samples 8 values of Sp and record the iteration time corresponding to each value by averaging 10 iterations. |