Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Synergy over Discrepancy: A Partition-Based Approach to Multi-Domain LLM Fine-Tuning

Authors: Hua Ye, Siyuan Chen, Haoliang Zhang, Weihao Luo, Yanbin Li, Xuan Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive empirical evaluations on various language understanding tasks show that our method consistently outperforms state-of-the-art baselines.
Researcher Affiliation Collaboration 1Nanjing University 2Airon Technology CO., LTD 3University of Bristol 4The University of Oklahoma 5Donghua University 6Beijing University of Posts and Telecommunications 7Carnegie Mellon University
Pseudocode Yes Algorithm 1 Multi-Stage Adapter Tuning for LLMs Require: Pretrained LLM parameters θ Rp; k source domains {D1, . . . , Dk}; discrepancy measure d(Di, Dj); synergy measure Synergy(Di, Dj); capacity cost Cap( ); norm bounds ρθ, ρϕ; number of stages M; (optional) mixing weights {αt j}. Ensure: Final backbone parameters θM; domain adapter parameters {ϕM j }k j=1.
Open Source Code No Due to institutional restrictions and proprietary considerations, the data and code used in this study are not publicly available at this time.
Open Datasets Yes Datasets We evaluate our method on four representative multi-domain language understanding tasks: 1) News Summarization (NSum) Hermann et al. [2015]. ... 2) Sentiment Classification (Sent) Socher et al. [2013]. ... 3) Question Answering (Q&A) Rajpurkar [2016]. ... 4) Topic Categorization (Topic) Zhang et al. [2015].
Dataset Splits Yes We partition each dataset into training, validation, and test splits. Statistics (number of samples, average text length, etc.) are presented in Appendix A.1. Table 6: Summary of the multi-domain datasets used in our experiments. Dataset #Train #Val #Test Metric NSum (News Summ.) 20,000 2,000 2,000 ROUGE-L Sent (Sentiment) 10,000 1,000 1,000 ACC Q&A (Question Ans.) 15,000 1,500 1,500 EM / F1 Topic (Classification) 12,000 1,200 1,200 ACC
Hardware Specification Yes Hardware and Software. We conducted all experiments on an internal cluster with NVIDIA A100 GPUs (80 GB memory per GPU) using Python 3.9, Py Torch 2.0.0, and Hugging Face Transformers 4.30.2.
Software Dependencies Yes Hardware and Software. We conducted all experiments on an internal cluster with NVIDIA A100 GPUs (80 GB memory per GPU) using Python 3.9, Py Torch 2.0.0, and Hugging Face Transformers 4.30.2.
Experiment Setup Yes Training Configuration. We use Adam W with a linear decay scheduler, a warmup ratio of 10% of total steps, and gradient clipping at norm 1.0. Table 7 gives key hyperparameters. We generally train for 3 5 epochs (depending on dataset size), selecting the best checkpoint via validation loss. Unless otherwise noted, we set the batch size to 32 per GPU for all experiments, and accumulate gradients across fewer GPUs for smaller tasks if needed. We adopt the default mixed-precision (fp16) training in Py Torch. Table 7: Default hyperparameter values. Hyperparameter Value Optimizer Adam W Learning rate (LLa MA2-7B) 3 10 5 Learning rate (LLa MA2-13B) 1 10 5 Learning rate (Falcon-40B) 5 10 6 Batch size (per GPU) 32 Max epochs 5 Warmup ratio 0.1 Gradient clipping 1.0 Precision FP16