Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Domain-Specific Pruning of Large Mixture-of-Experts Models with Few-shot Demonstrations

Authors: Zican Dong, Han Peng, Peiyu Liu, Xin Zhao, Dong Wu, Feng Xiao, Zhifeng Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on Deep Seek-R1 and Deep Seek-V3-0324 show that our method can achieve comparable performances and 2.99 throughput under the same memory budget as the full model, with only half the experts. The paper includes evaluation benchmarks, experiment settings, main results, ablation studies, and performance tables (Table 2, 3, 4, 5, 6, 9, 10, 11), along with a figure illustrating Throughput vs Performance (Figure 1), indicating empirical validation.
Researcher Affiliation	Collaboration	The authors are affiliated with 'Gaoling School of Artificial Intelligence, Renmin University of China' and 'University of International Business and Economics' (academic institutions), as well as 'Yan Tron Technology Co. Ltd' and 'EBTech Co. Ltd' (industry organizations), indicating a collaboration between academia and industry.
Pseudocode	No	The paper describes the methodology using prose and mathematical equations in Section 4
Open Source Code	Yes	Our code is available at https://github.com/RUCAIBox/EASYEP.
Open Datasets	Yes	To systematically assess the effectiveness of our proposed method, we conduct experiments across eight benchmark datasets: AIME-2024, AIME-2025, HMMT-Feb 2025, Live Code Bench [19], GPQA-Diamond [18], USMLE [22], Finance IQ [23], and Agent Bench-OS [24]. Table 8: Calibration Set of Each Domain lists specific datasets with licenses, such as AIME 2023, Live Code Bench-V3, GPQA-Main, Agent Dev Set of Agent Bench-OS, Finance Dev Set of Finance IQ, Medical Dev Set of USMLE, along with their respective citations.
Dataset Splits	Yes	For each domain, we randomly sample 25 instances and construct a calibration set by concatenating their inputs with the target model s outputs. For the evaluation of Finance IQ, we randomly select 1000 samples with the seed of 42 since the test set is too large.
Hardware Specification	Yes	necessitating 4 8 A800 or 2 8 H800 GPU configurations, respectively. Additionally, all the experiments are conducted in one 8 H200 GPU. We deploy Deep Seek-R1 with two 8 H800 for 224 and 256 experts, while one 8 H800 for others.
Software Dependencies	No	To evaluate the throughput of pruned models with different numbers of experts, we use the SGLang [26] package and measure performance under a maximum request concurrency of 32. While the SGLang package is mentioned, no specific version number is provided for it or any other software component used in the experiments.
Experiment Setup	Yes	We set the maximum context length to 32K, the temperature to 0.6, and the top-p sampling value to 0.95 for most benchmarks (temperature as 0.2 for Live Code Bench). We then evaluate the expert scores on the calibration data and select the top 64 and 128 experts with the highest scores at each layer, respectively. To ensure statistical reliability, most benchmark is evaluated independently 5 times (32 times for math benchmarks), and we report the average performance of pass@1.