Twin-Merging: Dynamic Integration of Modular Expertise in Model Merging

Authors: Zhenyi Lu, Chenghao Fan, Wei Wei, Xiaoye Qu, Dangyang Chen, Yu Cheng

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on 20 datasets for both language and vision tasks demonstrate the effectiveness of our method, showing an average improvement of 28.34% in absolute normalized score for discriminative tasks and even surpassing the fine-tuned upper bound on the generative tasks.
Researcher Affiliation Collaboration 1 School of Computer Science & Technology, Huazhong University of Science and Technology, 2 Joint Laboratory of HUST and Pingan Property & Casualty Research (HPL), 3 Ping An Property & Casualty Insurance Company of China, Ltd., 4 The Chinese University of Hong Kong.
Pseudocode Yes Algorithm 1 Twin-Merging
Open Source Code Yes 1Our implementation is available in https://github.com/LZY-the-boys/Twin-Merging
Open Datasets Yes For language discriminative tasks, following [76, 79], we use Ro BERTa [42] as the backbone and evaluate on the 8-task GLUE benchmark [69]... The licenses of QNLI, COLA, and STS-B are licensed under CC-BY-SA. QQP is licensed under MIT. SST-2 and MRPC are licensed under Apache 2.0. MNLI is licensed under OANC. RTE is licensed under CC BY 4.0. Thus, these datasets in GLUE are available for non-commercial research purposes.
Dataset Splits Yes We split 10% of the training set as a validation set and employ the original validation data as the test set.
Hardware Specification Yes We executed all our experiments on Nvidia A100 GPUs equipped with 80GB RAM.
Software Dependencies No The paper mentions frameworks and models like RoBERTa, Qwen-14B, and LoRA but does not specify version numbers for general software dependencies or libraries such as Python, PyTorch, or CUDA.
Experiment Setup Yes Our selected hyperparameters included a batch size of 64 and a learning rate set at 1e 5. For generative tasks, the fine-tuning process for Qwen-14B involved the utilization of Lo RA with a rank set to 32, a batch size of 128, and a learning rate of 2e 4 for 3 epochs.