Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MoESD: Unveil Speculative Decoding's Potential for Accelerating Sparse MoE
Authors: Zongle Huang, Lei Zhu, ZongYuan Zhan, Ting Hu, Weikai Mao, Xianzhi Yu, Yongpan Liu, Tianyu Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on different GPUs show up to 2.29x speedup for Qwen257B-A14B at medium batch sizes and validate our theoretical predictions. |
| Researcher Affiliation | Collaboration | 1Tsinghua University 2Huawei Noah s Ark Lab 3BNRist {huangzl23}@mails.tsinghua.edu.cn {ypliu}@tsinghua.edu.cn EMAIL |
| Pseudocode | Yes | Algorithm 1 The Modeling of SD Speedup and Corresponding Fitting Method |
| Open Source Code | Yes | We provide the complete codes and data as supplementary materials. |
| Open Datasets | Yes | Models are evaluated on Human Eval [43] and MT-bench [44] datasets for code generation and conversation tasks, following previous works [7, 45, 11]. |
| Dataset Splits | No | The paper mentions Human Eval and MT-bench datasets but does not specify how these datasets were split into training, test, or validation sets. It references "tokenized prompt lengths range from 38 to 391 tokens for Human Eval and 5 to 356 tokens for MT-bench" but this is about input characteristics, not splits. There is no mention of train/test/validation splits. |
| Hardware Specification | Yes | We conducted experiments on different hardware platforms including 2x A800, 2x H800, 4x A800, 4x L40. |
| Software Dependencies | No | We used the existing vllm [46] framework for our experiments to verify theoretical predictions. Vllm supports batched speculative decoding, cudagraph optimization, and reports comprehensive data such as TD, TT , Treject and σ, thus being suitable for our experiments. |
| Experiment Setup | Yes | When we need to examine Mo Es with different sparsity, we modify the num_experts_per_token in the model s config.json file. For comparison with dense models, we use Opt-30b and Opt-350m [42] as the target and draft models. Models are evaluated on Human Eval [43] and MT-bench [44] datasets for code generation and conversation tasks, following previous works [7, 45, 11]. The tokenized prompt lengths range from 38 to 391 tokens for Human Eval and 5 to 356 tokens for MT-bench. Frameworks and hardware. We used the existing vllm [46] framework for our experiments to verify theoretical predictions. Vllm supports batched speculative decoding, cudagraph optimization, and reports comprehensive data such as TD, TT , Treject and σ, thus being suitable for our experiments. To prevent unstable performance at the beginning, all data were obtained by averaging the results from the last five of the total ten runs. |