Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MoESD: Unveil Speculative Decoding's Potential for Accelerating Sparse MoE

Authors: Zongle Huang, Lei Zhu, ZongYuan Zhan, Ting Hu, Weikai Mao, Xianzhi Yu, Yongpan Liu, Tianyu Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on different GPUs show up to 2.29x speedup for Qwen257B-A14B at medium batch sizes and validate our theoretical predictions.
Researcher Affiliation Collaboration 1Tsinghua University 2Huawei Noah s Ark Lab 3BNRist {huangzl23}@mails.tsinghua.edu.cn {ypliu}@tsinghua.edu.cn EMAIL
Pseudocode Yes Algorithm 1 The Modeling of SD Speedup and Corresponding Fitting Method
Open Source Code Yes We provide the complete codes and data as supplementary materials.
Open Datasets Yes Models are evaluated on Human Eval [43] and MT-bench [44] datasets for code generation and conversation tasks, following previous works [7, 45, 11].
Dataset Splits No The paper mentions Human Eval and MT-bench datasets but does not specify how these datasets were split into training, test, or validation sets. It references "tokenized prompt lengths range from 38 to 391 tokens for Human Eval and 5 to 356 tokens for MT-bench" but this is about input characteristics, not splits. There is no mention of train/test/validation splits.
Hardware Specification Yes We conducted experiments on different hardware platforms including 2x A800, 2x H800, 4x A800, 4x L40.
Software Dependencies No We used the existing vllm [46] framework for our experiments to verify theoretical predictions. Vllm supports batched speculative decoding, cudagraph optimization, and reports comprehensive data such as TD, TT , Treject and σ, thus being suitable for our experiments.
Experiment Setup Yes When we need to examine Mo Es with different sparsity, we modify the num_experts_per_token in the model s config.json file. For comparison with dense models, we use Opt-30b and Opt-350m [42] as the target and draft models. Models are evaluated on Human Eval [43] and MT-bench [44] datasets for code generation and conversation tasks, following previous works [7, 45, 11]. The tokenized prompt lengths range from 38 to 391 tokens for Human Eval and 5 to 356 tokens for MT-bench. Frameworks and hardware. We used the existing vllm [46] framework for our experiments to verify theoretical predictions. Vllm supports batched speculative decoding, cudagraph optimization, and reports comprehensive data such as TD, TT , Treject and σ, thus being suitable for our experiments. To prevent unstable performance at the beginning, all data were obtained by averaging the results from the last five of the total ten runs.