Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

FlyLoRA: Boosting Task Decoupling and Parameter Efficiency via Implicit Rank-Wise Mixture-of-Experts

Authors: Heming Zou, Yunliang Zang, Wutong Xu, Yao Zhu, Xiangyang Ji

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across four domains general knowledge understanding, scientific question answering, mathematical reasoning, and code generation demonstrate consistent performance improvements over existing methods. We evaluate Fly Lo RA s performance across four key domains: (1) general knowledge understanding using the MMLU [25] benchmark with auxiliary training datasets for fine-tuning and test set for evaluation, (2) scientific question answering using the Science QA [48] dataset for fine-tuning and evaluation, (3) mathematical reasoning on GSM8K [12] problems for fine-tuning and evaluation, and (4) code generation assessed via Code Alpaca-20k [7] for training and Human Eval [9] for evaluation.
Researcher Affiliation	Academia	1Department of Automation, Tsinghua University 2Academy of Medical Engineering and Translational Medicine, Tianjin University
Pseudocode	No	The paper describes methods and formulas but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code	Yes	Code is available at https://github.com/gfyddha/Fly Lo RA.
Open Datasets	Yes	We evaluate Fly Lo RA s performance across four key domains: (1) general knowledge understanding using the MMLU [25] benchmark with auxiliary training datasets for fine-tuning and test set for evaluation, (2) scientific question answering using the Science QA [48] dataset for fine-tuning and evaluation, (3) mathematical reasoning on GSM8K [12] problems for fine-tuning and evaluation, and (4) code generation assessed via Code Alpaca-20k [7] for training and Human Eval [9] for evaluation.
Dataset Splits	Yes	Table 17: Details of MMLU, Science QA, GSM8K, Code Alpaca and Human Eval Datasets. We list the number of training and testing samples and task types for the following datasets used in our experiments. Dataset Training Samples Testing Samples Task Types MMLU [25] 99,842 14,042 Multiple Choice Science QA [48] 12,726 4,241 Multiple Choice GSM8K [12] 7,473 1,319 Math Problems Code Alpaca-20k [7] 20,022 Code Instruction Human Eval [9] 164 Code Generation
Hardware Specification	Yes	Most experiments were conducted on a Linux server running Ubuntu 20.04.4 LTS, equipped with an Intel(R) Xeon(R) Platinum 8358P CPU at 2.60GHz and 8 NVIDIA Ge Force RTX 3090 GPUs, using CUDA version 11.7. Experiments with Qwen-2.5-14B were conducted on a machine with 8 NVIDIA A100 GPUs.
Software Dependencies	Yes	Most experiments were conducted on a Linux server running Ubuntu 20.04.4 LTS, equipped with an Intel(R) Xeon(R) Platinum 8358P CPU at 2.60GHz and 8 NVIDIA Ge Force RTX 3090 GPUs, using CUDA version 11.7.
Experiment Setup	Yes	Table 18: General Training Hyperparameters for Fly Lo RA. Shared configuration across all experiments, including rank settings, optimizer details, and architectural choices. Parameter Value Total rank (r) 32 Scaling factor (α) 64 Activated rank 8 Target modules {q,k,v,o,gate,down,up}_proj Optimizer Adam W Warmup ratio 0.01 Gradient accumulated batch 128 Dropout rate 0.00. Table 19: Dataset-Specific and Model-Specific Training Configurations for Fly Lo RA. Taskoptimized settings for Llama-3.1-8B and Qwen-2.5-7B across four benchmarks, showing variations in epoch counts, learning rates, and sequence lengths based on dataset characteristics and model requirements.