Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
FlyLoRA: Boosting Task Decoupling and Parameter Efficiency via Implicit Rank-Wise Mixture-of-Experts
Authors: Heming Zou, Yunliang Zang, Wutong Xu, Yao Zhu, Xiangyang Ji
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across four domains general knowledge understanding, scientific question answering, mathematical reasoning, and code generation demonstrate consistent performance improvements over existing methods. We evaluate Fly Lo RA s performance across four key domains: (1) general knowledge understanding using the MMLU [25] benchmark with auxiliary training datasets for fine-tuning and test set for evaluation, (2) scientific question answering using the Science QA [48] dataset for fine-tuning and evaluation, (3) mathematical reasoning on GSM8K [12] problems for fine-tuning and evaluation, and (4) code generation assessed via Code Alpaca-20k [7] for training and Human Eval [9] for evaluation. |
| Researcher Affiliation | Academia | 1Department of Automation, Tsinghua University 2Academy of Medical Engineering and Translational Medicine, Tianjin University |
| Pseudocode | No | The paper describes methods and formulas but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures. |
| Open Source Code | Yes | Code is available at https://github.com/gfyddha/Fly Lo RA. |
| Open Datasets | Yes | We evaluate Fly Lo RA s performance across four key domains: (1) general knowledge understanding using the MMLU [25] benchmark with auxiliary training datasets for fine-tuning and test set for evaluation, (2) scientific question answering using the Science QA [48] dataset for fine-tuning and evaluation, (3) mathematical reasoning on GSM8K [12] problems for fine-tuning and evaluation, and (4) code generation assessed via Code Alpaca-20k [7] for training and Human Eval [9] for evaluation. |
| Dataset Splits | Yes | Table 17: Details of MMLU, Science QA, GSM8K, Code Alpaca and Human Eval Datasets. We list the number of training and testing samples and task types for the following datasets used in our experiments. Dataset Training Samples Testing Samples Task Types MMLU [25] 99,842 14,042 Multiple Choice Science QA [48] 12,726 4,241 Multiple Choice GSM8K [12] 7,473 1,319 Math Problems Code Alpaca-20k [7] 20,022 Code Instruction Human Eval [9] 164 Code Generation |
| Hardware Specification | Yes | Most experiments were conducted on a Linux server running Ubuntu 20.04.4 LTS, equipped with an Intel(R) Xeon(R) Platinum 8358P CPU at 2.60GHz and 8 NVIDIA Ge Force RTX 3090 GPUs, using CUDA version 11.7. Experiments with Qwen-2.5-14B were conducted on a machine with 8 NVIDIA A100 GPUs. |
| Software Dependencies | Yes | Most experiments were conducted on a Linux server running Ubuntu 20.04.4 LTS, equipped with an Intel(R) Xeon(R) Platinum 8358P CPU at 2.60GHz and 8 NVIDIA Ge Force RTX 3090 GPUs, using CUDA version 11.7. |
| Experiment Setup | Yes | Table 18: General Training Hyperparameters for Fly Lo RA. Shared configuration across all experiments, including rank settings, optimizer details, and architectural choices. Parameter Value Total rank (r) 32 Scaling factor (α) 64 Activated rank 8 Target modules {q,k,v,o,gate,down,up}_proj Optimizer Adam W Warmup ratio 0.01 Gradient accumulated batch 128 Dropout rate 0.00. Table 19: Dataset-Specific and Model-Specific Training Configurations for Fly Lo RA. Taskoptimized settings for Llama-3.1-8B and Qwen-2.5-7B across four benchmarks, showing variations in epoch counts, learning rates, and sequence lengths based on dataset characteristics and model requirements. |