TIES-Merging: Resolving Interference When Merging Models

Authors: Prateek Yadav, Derek Tam, Leshem Choshen, Colin A. Raffel, Mohit Bansal

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We find that TIES-MERGING outperforms several existing methods in diverse settings covering a range of modalities, domains, number of tasks, model sizes, architectures, and fine-tuning settings. We further analyze the impact of different types of interference on model parameters, and highlight the importance of resolving sign interference.1
Researcher Affiliation Collaboration 1 University of North Carolina at Chapel Hill 2 IBM Research 3 MIT
Pseudocode Yes Algorithm 1 TIES-MERGING Procedure.
Open Source Code Yes Our code is available at https://github.com/prateeky2806/ties-merging
Open Datasets Yes Specifically, we focus on (IA)3 [43] [...] and finetune (IA)3 models on the train split of eleven datasets including sentence completion (COPA [61], H-SWAG [88], and Story Cloze [68] datasets), natural language inference (ANLI [49], CB [44], and RTE [11]), coreference resolution (WSC [37] and Winogrande [64]), and word sense disambiguation (Wi C [53]).
Dataset Splits Yes We demonstrate the effectiveness of our proposed TIES-MERGING method in various setups with: [...] (5) in the presence or absence of a validation set for setting merging hyperparameters. [...] In Table 6, we start with TIES-MERGING and remove one component at a time and report the performance on the validation set for full model merging (T5-base) and merging PEFT models ((IA)3 on T03B).
Hardware Specification Yes We executed all our experiments on Nvidia A6000 GPUs equipped with 48GB RAM.
Software Dependencies No The paper mentions models like T5 and CLIP but does not provide specific version numbers for software dependencies such as PyTorch, TensorFlow, or CUDA.
Experiment Setup Yes In our research, we utilized two variants of the T5 model, specifically the T5-base and T5-large models, which were trained to a maximum of 75,000 steps. An effective training batch size of 1024 was implemented, alongside a learning rate (lr) of 0.0001. We instituted an early stopping mechanism with a patience threshold of 5 to prevent overfitting. During the training process, bfloat16 was adopted to curtail GPU memory expenditure, and the maximum sequence length was set at 128. In contrast, for the PEFT configuration of the (IA)3 approach on the T0-3B model, we modified our parameters. An effective training batch size of 16 was deployed along with an evaluation batch size of 32, while maintaining the learning rate at 0.0001. To accommodate the model s complexity, the early stopping patience was augmented to 10.