Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DiEP: Adaptive Mixture-of-Experts Compression through Differentiable Expert Pruning

Authors: Sikai Bai, Haoxi Li, Jie ZHANG, Zicong Hong, Song Guo

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on five advanced Mo E models demonstrate the efficacy of our method across various NLP tasks. Notably, Di EP retains around 92% of original performance on Mixtral 8 7B with only half the experts, outperforming other pruning methods by up to 7.1% on the challenging MMLU dataset.
Researcher Affiliation	Academia	Sikai Bai HKUST Hong Kong, China EMAIL Haoxi Li HKUST Hong Kong, China EMAIL Jie Zhang HKUST Hong Kong, China EMAIL Zicong Hong HKUST Hong Kong, China EMAIL Song Guo HKUST Hong Kong, China EMAIL
Pseudocode	Yes	Algorithm 1: Di EP Differentiable Expert Pruning
Open Source Code	No	Justification: We will release all codes after our paper is accepted.
Open Datasets	Yes	We evaluate model performance using the Language Model Evaluation Harness library [13] across four zero-shot tasks: MMLU [18], Open Book QA [34], Bool Q [8], and RTE [4]. ... During the expert pruning phase, we construct a small calibration subset with 128 samples from the C4 dataset for fine-tuning purposes.
Dataset Splits	Yes	We evaluate model performance using the Language Model Evaluation Harness library [13] across four zero-shot tasks: MMLU [18], Open Book QA [34], Bool Q [8], and RTE [4]. ... During the expert pruning phase, we construct a small calibration subset with 128 samples from the C4 dataset for fine-tuning purposes.
Hardware Specification	Yes	All experimental evaluations are conducted using four NVIDIA Ge Force A800 GPUs.
Software Dependencies	No	The paper mentions using the 'Language Model Evaluation Harness library' but does not provide specific version numbers for any software dependencies like Python, PyTorch, or other libraries used in implementation.
Experiment Setup	Yes	During the expert pruning phase, we construct a small calibration subset with 128 samples from the C4 dataset for fine-tuning purposes. We implement parameter-efficient differential learning through alternating training cycles, with a 3:1 ratio between intra-layer scores α and inter-layer scores β updates. Both training processes employ a learning rate of 5e-3 with a cosine learning rate scheduler. In addition, the complete training protocol consists of 10 epochs with a batch size of 16. For weight hyperparameter settings, we use λ = 0.01 for all Mixtral architectures and λ = 0.01 for other Mo E models.