Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

S'MoRE: Structural Mixture of Residual Experts for Parameter-Efficient LLM Fine-tuning

Authors: Hanqing Zeng, Yinglong Xia, Zhuokai Zhao, Chuan Jiang, Qiang Zhang, Jiayi Liu, Qunshu Zhang, Lizhu Zhang, Xiangjun Fan, Benyu Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive theoretical analysis and empirical results demonstrate that S Mo RE achieves superior fine-tuning performance, offering a transformative approach for efficient LLM adaptation. We extensively evaluate S Mo RE on 3 base models (LLa MA 3.2 1B, LLa MA 3 8B and Gemma 2 9B), 7 fine-tuning benchmarks, 3 types of router gates, and across different scales.
Researcher Affiliation	Industry	Hanqing Zeng Meta AI EMAIL Yinglong Xia Meta AI EMAIL Zhuokai Zhao Meta AI EMAIL Chuan Jiang Meta AI EMAIL Qiang Zhang Meta AI EMAIL Jiayi Liu Meta AI EMAIL Qunshu Zhang Meta AI EMAIL Lizhu Zhang Meta AI EMAIL Xiangjun Fan Meta AI EMAIL Benyu Zhang Meta AI EMAIL
Pseudocode	No	The paper describes the model design and routing process in sections 3.2 and 3.3, and details three types of gates in Appendix B.1, but does not provide a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	Our implementation is available at: https://github.com/Zimple X/SMo RE-LLM.
Open Datasets	Yes	We fine-tune on a diverse set of benchmarks, including ARC-c/e [Clark et al., 2018], Commonsense QA (CSQA) [Talmor et al., 2018], Open Book QA (OBQA) [Mihaylov et al., 2018], Winogrande [Sakaguchi et al., 2021], GSM8K [Cobbe et al., 2021], and Human Eval [Chen et al., 2021].
Dataset Splits	Yes	For Human Eval, we follow Tian et al. [2024] to train the base LLM on Code Alpaca [Chaudhary, 2023], and evaluate Pass@1 on Human Eval. For all other datasets, we fine-tune on the training split and evaluate Accuracy on the test split.
Hardware Specification	Yes	For the computation hardware, all experiments are run on a single node with 4 NVIDIA A100 80GB GPUs.
Software Dependencies	No	We implement S Mo RE by adding a customized adapter to the Hugging Face PEFT library [Mangrulkar et al., 2022]. All models are trained via the LLa MA-Factory [Zheng et al., 2024] SFT pipeline, ensuring a consistent execution environment. Similarly, all the evaluations are conducted through Open Compass [Contributors, 2023b], which is a unified evaluation framework providing a standard API for all considered benchmarks.
Experiment Setup	Yes	For hyperparameter tuning, we train all models using the same number of epochs, learning rate schedule, gradient accumulation steps and machine type. All models are trained under the LLa MA-Factory [Zheng et al., 2024] framework and evaluated with Open Compass [Contributors, 2023b]. For hyperparameter search, we set an equal budget of trainable parameters, and vary the expert rank, the number of experts, the number of activated experts, etc. See Appendix D.2 for details of the hyperparameter range, and the hardware / software configuration. All baselines and S Mo RE are trained with 2 epochs, with learning rate 1e 4. The learning rate follows a cosine schedule.