Merging Models with Fisher-Weighted Averaging

Authors: Michael S Matena, Colin A. Raffel

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we demonstrate that merging models with Fisher merging outperforms isotropic merging in a variety of settings. We first focus on the existing applications of model ensembling [49, 67] and improving fine-tuned model robustness [66]. Then, we demonstrate for the first time that merging is a viable alternative to traditional gradient-based transfer learning. Specifically, we compare merging to intermediate-task transfer learning [47, 51] and domain-adaptive pre-training [19], finding that merging can achieve comparable performance at significantly lower cost. Additionally, we show that merging can provide an additional boost to models created via traditional intermediate-task training. This provides a concrete example of transfer that is fast and easy with merging but onerous or impossible to do with existing methods. Diagrams of the merging patterns we consider in this work are shown in fig. 1.
Researcher Affiliation Academia Michael Matena Colin Raffel Department of Computer Science University of North Carolina at Chapel Hill {mmatena,craffel}@cs.unc.edu
Pseudocode No The paper provides mathematical equations for its methods (e.g., equation 4 for Fisher Merging), but it does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes We release our code to facilitate future research into methods for merging models.1 (footnote 1: https://github.com/mmatena/model_merging)
Open Datasets Yes Specifically, we consider the BERT-Base model [13] fine-tuned on the RTE [8], MRPC [14], and SST-2 [59] datasets. For each dataset, we use five fine-tuned checkpoints downloaded from the Hugging Face model hub.2... The GLUE benchmark consists of the sentence acceptability task Co LA [64], the sentiment detection task SST-2 [59], the paraphrase detection tasks MRPC and QQP [14, 23], the sentence similarity task STS-B [7], and the natural language inference (NLI) tasks MNLI, QNLI, RTE, and WNLI [6, 54, 8, 31].
Dataset Splits Yes We report validation set scores for Fisher merging, isotropic merging, and prediction ensembling (specifically, averaging the output probabilties of all models). (Section 3.1) and We chose λi by a grid search with 50 points, using the score on the first 2048 validation examples as the selection metric. (Section 3.3)
Hardware Specification No The paper does not specify the hardware used for experiments, such as specific GPU or CPU models. It only calculates FLOPs required for fine-tuning and merging, but without detailing the hardware on which these operations would run.
Software Dependencies No The paper mentions software components like Hugging Face model hub [65] and the Adam optimizer [27], but it does not provide specific version numbers for any software dependencies, such as PyTorch, TensorFlow, or Python.
Experiment Setup Yes We use the codebase and experimental setup of Wortsman et al. [66] exactly, simply replacing isotropic merging with Fisher merging... we apply Wi SE-FT to the Image Net [11, 58] pre-trained Vi T-B/16 model [15] on five out-of-domain (OOD) datasets... varying λ1 (the averaging weight for the pre-trained model, called by Wortsman et al. [66]) from 0 to 1 in 0.1-step increments... We computed a diagonal Fisher approximation for each checkpoint using up to 4096 examples from the corresponding train set. Since it is not clear a priori what weighting coefficients λi to use in this setting, we chose λi by a grid search with 50 points, using the score on the first 2048 validation examples as the selection metric.