Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Advancing Expert Specialization for Better MoE

Authors: Hongcan Guo, Haolang Lu, Guoshun Nan, Bolun Chu, Jialin Zhuang, Yuan Yang, Wenhao Che, Xinye Cao, Sicong Leng, Qimei Cui, Xudong Jiang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we conduct experiments to address the following research questions: RQ1: Does introducing the orthogonality loss (Lo) and variance loss (Lv) lead to better overall performance in downstream tasks compared to baseline approaches? RQ2: To what extent does our method maintain expert load balancing during training? RQ3: How do the orthogonality loss (Lo) and variance loss (Lv) interact with each other, and what are their respective and joint impacts on expert specialization and routing behavior? RQ4: What are the individual and combined contributions of Lo, Lv, and the auxiliary loss Laux to the final model performance.
Researcher Affiliation	Academia	1Beijing University of Posts and Telecommunications, China 2Nanyang Technological University, Singapore
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. It describes methodologies using mathematical formulas and prose.
Open Source Code	Yes	Our code is available at this link.
Open Datasets	Yes	Datasets. We evaluate our method on a total of 11 benchmarks. Specifically, we use the training sets from Numina [41], GLUE [66], and the FLAN collection [72] to train our models. Our benchmarks include: ❶Mathematics: GSM8K [12], MATH500 [44], and Numina [41]; ❷Multi-Domain Tasks: MMLU [31, 30], MMLU-pro [70], BBH [63], GLUE [66]; Live Bench [76] and GPQA [59]. ❸Code generation: Human Eval [10] and MBPP [4]. We group training and test sets by language, reasoning, science, math, and code to match downstream evaluation needs. Detail in Appendix D.
Dataset Splits	Yes	Setup. Each benchmark is fine-tuned separately on 6,000 high-quality examples, primarily from the official training split and supplemented when necessary. Answers are generated using strong teacher models (Open AI o3-mini and Deep Seek R1) and manually verified for correctness. Fine-tuning is limited to three epochs ( 550 steps) to prevent overfitting.
Hardware Specification	Yes	Environment. All experiments are performed on a Cent OS Linux 7 server with Py Torch 2.3. The hardware specifications consist of 240GB of RAM, a 16-core Intel Xeon CPU, and two NVIDIA A800 GPUs, each having 80GB of memory.
Software Dependencies	Yes	Environment. All experiments are performed on a Cent OS Linux 7 server with Py Torch 2.3.
Experiment Setup	Yes	Setup. Each benchmark is fine-tuned separately on 6,000 high-quality examples, primarily from the official training split and supplemented when necessary. Answers are generated using strong teacher models (Open AI o3-mini and Deep Seek R1) and manually verified for correctness. Fine-tuning is limited to three epochs ( 550 steps) to prevent overfitting. All experiments adopt Lo RA-based fine-tuning, with Lo RA modules inserted into both router and expert layers to enable joint optimization. A rank of 32 is used to approximate full-model updates. Detailed configurations, including optimizer, batch size, and learning rate, are provided in Appendix H.2.