Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
FlashMoE: Fast Distributed MoE in a Single Kernel
Authors: Osayamen Aimuyo, Byungsoo Oh, Rachee Singh
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | When evaluated on an 8-H100 GPU node with Mo E models having up to 128 experts and 16K token sequences, Flash Mo E achieves up to 9 higher GPU utilization, 6 lower latency, 5.7 higher throughput, and 4 better overlap efficiency compared to state-of-the-art baselines despite using FP32 while baselines use FP16. Flash Mo E shows that principled GPU kernel-hardware co-design is key to unlocking the performance ceiling of large-scale distributed ML. We implement ( G) and evaluate Flash Mo E across five metrics: Forward Latency ( 4.1), GPU Utilization ( 4.2), Overlap Efficiency ( 4.4), Throughput ( 4.3), and Expert Scalability ( 4.5). |
| Researcher Affiliation | Academia | Osayamen Jonathan Aimuyo Cornell University EMAIL Byungsoo Oh Cornell University EMAIL Rachee Singh Cornell University EMAIL |
| Pseudocode | Yes | Algorithm 1: Flash Mo E Distributed Mo E Fused Kernel Algorithm 2: Processor Actor: executed by a block Algorithm 3: Scheduler Actor: executed by one warp Algorithm 4: Subscriber Actor: executed by three warps |
| Open Source Code | Yes | We provide code at https://github.com/osayamenja/Flash Mo E. |
| Open Datasets | No | The paper discusses evaluating MoE models and comparing performance against baselines like Comet, Faster MoE, and Megatron-LM. It mentions 'Flash Mo E performance' and 'GPT-3 Mo E model' in figures and supplementary material, but does not provide concrete access information (link, DOI, specific repository, or formal citation with authors/year) for any dataset used in its experiments. It describes the model configuration (e.g., '16 attention heads, an embedding dimension of 2048') but not the datasets they were trained on or evaluated against with explicit access details. |
| Dataset Splits | No | The paper states: 'All experiments use Mo E transformer models configured with 16 attention heads, an embedding dimension of 2048, and an FFN intermediate size of 2048. We apply Distributed Data Parallelism (DDP) and Expert Parallelism for all experiments. We execute only the forward pass over a single Mo E layer and measure the average runtime of 32 passes after 32 warmup passes. We use top-2 routing with a capacity factor of 1.0.' This describes the model and experimental execution details, but does not specify how any dataset (if used) was split into training, validation, or test sets. |
| Hardware Specification | Yes | When evaluated on an 8-H100 GPU node with Mo E models... We run experiments on a server with 8 NVIDIA H100 80G GPUs interconnected via NVLink, 125 GB of RAM, and 20 v CPUs. |
| Software Dependencies | Yes | We used Py Torch 2.6.0, CUDA 12.8, and Ubuntu 22.04. |
| Experiment Setup | Yes | All experiments use Mo E transformer models configured with 16 attention heads, an embedding dimension of 2048, and an FFN intermediate size of 2048. We apply Distributed Data Parallelism (DDP) and Expert Parallelism for all experiments. We execute only the forward pass over a single Mo E layer and measure the average runtime of 32 passes after 32 warmup passes. We use top-2 routing with a capacity factor of 1.0. |