reproducibilityindex.ai

TIES-Merging: Resolving Interference When Merging Models

Authors: Prateek Yadav, Derek Tam, Leshem Choshen, Colin A. Raffel, Mohit Bansal

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We find that TIES-MERGING outperforms several existing methods in diverse settings covering a range of modalities, domains, number of tasks, model sizes, architectures, and fine-tuning settings. We further analyze the impact of different types of interference on model parameters, and highlight the importance of resolving sign interference.1
Researcher Affiliation	Collaboration	1 University of North Carolina at Chapel Hill 2 IBM Research 3 MIT
Pseudocode	Yes	Algorithm 1 TIES-MERGING Procedure.
Open Source Code	Yes	Our code is available at https://github.com/prateeky2806/ties-merging
Open Datasets	Yes	Specifically, we focus on (IA)3 [43] [...] and finetune (IA)3 models on the train split of eleven datasets including sentence completion (COPA [61], H-SWAG [88], and Story Cloze [68] datasets), natural language inference (ANLI [49], CB [44], and RTE [11]), coreference resolution (WSC [37] and Winogrande [64]), and word sense disambiguation (Wi C [53]).
Dataset Splits	Yes	We demonstrate the effectiveness of our proposed TIES-MERGING method in various setups with: [...] (5) in the presence or absence of a validation set for setting merging hyperparameters. [...] In Table 6, we start with TIES-MERGING and remove one component at a time and report the performance on the validation set for full model merging (T5-base) and merging PEFT models ((IA)3 on T03B).
Hardware Specification	Yes	We executed all our experiments on Nvidia A6000 GPUs equipped with 48GB RAM.
Software Dependencies	No	The paper mentions models like T5 and CLIP but does not provide specific version numbers for software dependencies such as PyTorch, TensorFlow, or CUDA.
Experiment Setup	Yes	In our research, we utilized two variants of the T5 model, specifically the T5-base and T5-large models, which were trained to a maximum of 75,000 steps. An effective training batch size of 1024 was implemented, alongside a learning rate (lr) of 0.0001. We instituted an early stopping mechanism with a patience threshold of 5 to prevent overfitting. During the training process, bfloat16 was adopted to curtail GPU memory expenditure, and the maximum sequence length was set at 128. In contrast, for the PEFT configuration of the (IA)3 approach on the T0-3B model, we modified our parameters. An effective training batch size of 16 was deployed along with an evaluation batch size of 32, while maintaining the learning rate at 0.0001. To accommodate the model s complexity, the early stopping patience was augmented to 10.