Parameter Competition Balancing for Model Merging

Authors: Guodong DU, Junlin Lee, Jing Li, Runhua Jiang, Yifei Guo, Shuyang Yu, Hanting Liu, Sim Kuan Goh, Ho-Kin Tang, Daojing He, Min Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We assessed our approach in diverse merging scenarios, including cross-task, cross-domain, and cross-training configurations, as well as out-of-domain generalization. The experimental results reveal that our approach achieves substantial performance enhancements across multiple modalities, domains, model sizes, number of tasks, fine-tuning forms, and large language models, outperforming existing model merging methods.
Researcher Affiliation Academia 1Harbin Institute of Technology, Shenzhen, China 2Xiamen University Malaysia 3Johns Hopkins University
Pseudocode Yes Algorithm 1 PCB-Merging Procedure. Input: Fine-tuned models {θi}n i=1, Initialization θpre, mask ratio r and coefficient λ. Output: Merged Model θm
Open Source Code Yes The code is publicly available at: https://github.com/duguodong7/pcb-merging.
Open Datasets Yes CMMLU [38] is a comprehensive Chinese evaluation benchmark... GSM8K [10] is a collection of 8.5K high-quality, linguistically varied math word problems... Human Eval [6] is a dataset for evaluating code generation ability... MNIST [36] features grayscale images of handwritten digits across 10 classes. http://yann.lecun.com/ exdb/mnist/
Dataset Splits Yes Most model merging methods necessitate access to a validation set, utilized for computing the Fisher matrix or tuning hyperparameters. and Tab. 4 presents the corresponding metrics on the validation set, showing consistent performance improvements with PCB-MERGING across all datasets.
Hardware Specification Yes Our experiments were conducted on Nvidia A6000 GPUs with 48GB of RAM.
Software Dependencies No The paper mentions software like 'Adam W optimizer' and specific models/frameworks (T5, ViT, Llama2, PEFT, (IA)3, Roberta-base) but does not provide specific version numbers for any key software components or libraries.
Experiment Setup Yes We trained the T5-base and T5-large models for up to 75,000 steps, using an effective training batch size of 1024 and a learning rate of 0.0001. To prevent overfitting, we implemented an early stopping mechanism with a patience of 5. Training was conducted in bfloat16 to conserve GPU memory, with a maximum sequence length of 128 tokens.