TIES-Merging: Resolving Interference When Merging Models
Authors: Prateek Yadav, Derek Tam, Leshem Choshen, Colin A. Raffel, Mohit Bansal
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We find that TIES-MERGING outperforms several existing methods in diverse settings covering a range of modalities, domains, number of tasks, model sizes, architectures, and fine-tuning settings. We further analyze the impact of different types of interference on model parameters, and highlight the importance of resolving sign interference.1 |
| Researcher Affiliation | Collaboration | 1 University of North Carolina at Chapel Hill 2 IBM Research 3 MIT |
| Pseudocode | Yes | Algorithm 1 TIES-MERGING Procedure. |
| Open Source Code | Yes | Our code is available at https://github.com/prateeky2806/ties-merging |
| Open Datasets | Yes | Specifically, we focus on (IA)3 [43] [...] and finetune (IA)3 models on the train split of eleven datasets including sentence completion (COPA [61], H-SWAG [88], and Story Cloze [68] datasets), natural language inference (ANLI [49], CB [44], and RTE [11]), coreference resolution (WSC [37] and Winogrande [64]), and word sense disambiguation (Wi C [53]). |
| Dataset Splits | Yes | We demonstrate the effectiveness of our proposed TIES-MERGING method in various setups with: [...] (5) in the presence or absence of a validation set for setting merging hyperparameters. [...] In Table 6, we start with TIES-MERGING and remove one component at a time and report the performance on the validation set for full model merging (T5-base) and merging PEFT models ((IA)3 on T03B). |
| Hardware Specification | Yes | We executed all our experiments on Nvidia A6000 GPUs equipped with 48GB RAM. |
| Software Dependencies | No | The paper mentions models like T5 and CLIP but does not provide specific version numbers for software dependencies such as PyTorch, TensorFlow, or CUDA. |
| Experiment Setup | Yes | In our research, we utilized two variants of the T5 model, specifically the T5-base and T5-large models, which were trained to a maximum of 75,000 steps. An effective training batch size of 1024 was implemented, alongside a learning rate (lr) of 0.0001. We instituted an early stopping mechanism with a patience threshold of 5 to prevent overfitting. During the training process, bfloat16 was adopted to curtail GPU memory expenditure, and the maximum sequence length was set at 128. In contrast, for the PEFT configuration of the (IA)3 approach on the T0-3B model, we modified our parameters. An effective training batch size of 16 was deployed along with an evaluation batch size of 32, while maintaining the learning rate at 0.0001. To accommodate the model s complexity, the early stopping patience was augmented to 10. |