Parameter Competition Balancing for Model Merging
Authors: Guodong DU, Junlin Lee, Jing Li, Runhua Jiang, Yifei Guo, Shuyang Yu, Hanting Liu, Sim Kuan Goh, Ho-Kin Tang, Daojing He, Min Zhang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We assessed our approach in diverse merging scenarios, including cross-task, cross-domain, and cross-training configurations, as well as out-of-domain generalization. The experimental results reveal that our approach achieves substantial performance enhancements across multiple modalities, domains, model sizes, number of tasks, fine-tuning forms, and large language models, outperforming existing model merging methods. |
| Researcher Affiliation | Academia | 1Harbin Institute of Technology, Shenzhen, China 2Xiamen University Malaysia 3Johns Hopkins University |
| Pseudocode | Yes | Algorithm 1 PCB-Merging Procedure. Input: Fine-tuned models {θi}n i=1, Initialization θpre, mask ratio r and coefficient λ. Output: Merged Model θm |
| Open Source Code | Yes | The code is publicly available at: https://github.com/duguodong7/pcb-merging. |
| Open Datasets | Yes | CMMLU [38] is a comprehensive Chinese evaluation benchmark... GSM8K [10] is a collection of 8.5K high-quality, linguistically varied math word problems... Human Eval [6] is a dataset for evaluating code generation ability... MNIST [36] features grayscale images of handwritten digits across 10 classes. http://yann.lecun.com/ exdb/mnist/ |
| Dataset Splits | Yes | Most model merging methods necessitate access to a validation set, utilized for computing the Fisher matrix or tuning hyperparameters. and Tab. 4 presents the corresponding metrics on the validation set, showing consistent performance improvements with PCB-MERGING across all datasets. |
| Hardware Specification | Yes | Our experiments were conducted on Nvidia A6000 GPUs with 48GB of RAM. |
| Software Dependencies | No | The paper mentions software like 'Adam W optimizer' and specific models/frameworks (T5, ViT, Llama2, PEFT, (IA)3, Roberta-base) but does not provide specific version numbers for any key software components or libraries. |
| Experiment Setup | Yes | We trained the T5-base and T5-large models for up to 75,000 steps, using an effective training batch size of 1024 and a learning rate of 0.0001. To prevent overfitting, we implemented an early stopping mechanism with a patience of 5. Training was conducted in bfloat16 to conserve GPU memory, with a maximum sequence length of 128 tokens. |