Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Synergy over Discrepancy: A Partition-Based Approach to Multi-Domain LLM Fine-Tuning

Authors: Hua Ye, Siyuan Chen, Haoliang Zhang, Weihao Luo, Yanbin Li, Xuan Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive empirical evaluations on various language understanding tasks show that our method consistently outperforms state-of-the-art baselines.
Researcher Affiliation	Collaboration	1Nanjing University 2Airon Technology CO., LTD 3University of Bristol 4The University of Oklahoma 5Donghua University 6Beijing University of Posts and Telecommunications 7Carnegie Mellon University
Pseudocode	Yes	Algorithm 1 Multi-Stage Adapter Tuning for LLMs Require: Pretrained LLM parameters θ Rp; k source domains {D1, . . . , Dk}; discrepancy measure d(Di, Dj); synergy measure Synergy(Di, Dj); capacity cost Cap( ); norm bounds ρθ, ρϕ; number of stages M; (optional) mixing weights {αt j}. Ensure: Final backbone parameters θM; domain adapter parameters {ϕM j }k j=1.
Open Source Code	No	Due to institutional restrictions and proprietary considerations, the data and code used in this study are not publicly available at this time.
Open Datasets	Yes	Datasets We evaluate our method on four representative multi-domain language understanding tasks: 1) News Summarization (NSum) Hermann et al. [2015]. ... 2) Sentiment Classiﬁcation (Sent) Socher et al. [2013]. ... 3) Question Answering (Q&A) Rajpurkar [2016]. ... 4) Topic Categorization (Topic) Zhang et al. [2015].
Dataset Splits	Yes	We partition each dataset into training, validation, and test splits. Statistics (number of samples, average text length, etc.) are presented in Appendix A.1. Table 6: Summary of the multi-domain datasets used in our experiments. Dataset #Train #Val #Test Metric NSum (News Summ.) 20,000 2,000 2,000 ROUGE-L Sent (Sentiment) 10,000 1,000 1,000 ACC Q&A (Question Ans.) 15,000 1,500 1,500 EM / F1 Topic (Classiﬁcation) 12,000 1,200 1,200 ACC
Hardware Specification	Yes	Hardware and Software. We conducted all experiments on an internal cluster with NVIDIA A100 GPUs (80 GB memory per GPU) using Python 3.9, Py Torch 2.0.0, and Hugging Face Transformers 4.30.2.
Software Dependencies	Yes	Hardware and Software. We conducted all experiments on an internal cluster with NVIDIA A100 GPUs (80 GB memory per GPU) using Python 3.9, Py Torch 2.0.0, and Hugging Face Transformers 4.30.2.
Experiment Setup	Yes	Training Conﬁguration. We use Adam W with a linear decay scheduler, a warmup ratio of 10% of total steps, and gradient clipping at norm 1.0. Table 7 gives key hyperparameters. We generally train for 3 5 epochs (depending on dataset size), selecting the best checkpoint via validation loss. Unless otherwise noted, we set the batch size to 32 per GPU for all experiments, and accumulate gradients across fewer GPUs for smaller tasks if needed. We adopt the default mixed-precision (fp16) training in Py Torch. Table 7: Default hyperparameter values. Hyperparameter Value Optimizer Adam W Learning rate (LLa MA2-7B) 3 10 5 Learning rate (LLa MA2-13B) 1 10 5 Learning rate (Falcon-40B) 5 10 6 Batch size (per GPU) 32 Max epochs 5 Warmup ratio 0.1 Gradient clipping 1.0 Precision FP16