reproducibilityindex.ai

Merging Models with Fisher-Weighted Averaging

Authors: Michael S Matena, Colin A. Raffel

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we demonstrate that merging models with Fisher merging outperforms isotropic merging in a variety of settings. We ﬁrst focus on the existing applications of model ensembling [49, 67] and improving ﬁne-tuned model robustness [66]. Then, we demonstrate for the ﬁrst time that merging is a viable alternative to traditional gradient-based transfer learning. Speciﬁcally, we compare merging to intermediate-task transfer learning [47, 51] and domain-adaptive pre-training [19], ﬁnding that merging can achieve comparable performance at signiﬁcantly lower cost. Additionally, we show that merging can provide an additional boost to models created via traditional intermediate-task training. This provides a concrete example of transfer that is fast and easy with merging but onerous or impossible to do with existing methods. Diagrams of the merging patterns we consider in this work are shown in ﬁg. 1.
Researcher Affiliation	Academia	Michael Matena Colin Raffel Department of Computer Science University of North Carolina at Chapel Hill {mmatena,craffel}@cs.unc.edu
Pseudocode	No	The paper provides mathematical equations for its methods (e.g., equation 4 for Fisher Merging), but it does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	We release our code to facilitate future research into methods for merging models.1 (footnote 1: https://github.com/mmatena/model_merging)
Open Datasets	Yes	Speciﬁcally, we consider the BERT-Base model [13] ﬁne-tuned on the RTE [8], MRPC [14], and SST-2 [59] datasets. For each dataset, we use ﬁve ﬁne-tuned checkpoints downloaded from the Hugging Face model hub.2... The GLUE benchmark consists of the sentence acceptability task Co LA [64], the sentiment detection task SST-2 [59], the paraphrase detection tasks MRPC and QQP [14, 23], the sentence similarity task STS-B [7], and the natural language inference (NLI) tasks MNLI, QNLI, RTE, and WNLI [6, 54, 8, 31].
Dataset Splits	Yes	We report validation set scores for Fisher merging, isotropic merging, and prediction ensembling (speciﬁcally, averaging the output probabilties of all models). (Section 3.1) and We chose λi by a grid search with 50 points, using the score on the ﬁrst 2048 validation examples as the selection metric. (Section 3.3)
Hardware Specification	No	The paper does not specify the hardware used for experiments, such as specific GPU or CPU models. It only calculates FLOPs required for fine-tuning and merging, but without detailing the hardware on which these operations would run.
Software Dependencies	No	The paper mentions software components like Hugging Face model hub [65] and the Adam optimizer [27], but it does not provide specific version numbers for any software dependencies, such as PyTorch, TensorFlow, or Python.
Experiment Setup	Yes	We use the codebase and experimental setup of Wortsman et al. [66] exactly, simply replacing isotropic merging with Fisher merging... we apply Wi SE-FT to the Image Net [11, 58] pre-trained Vi T-B/16 model [15] on ﬁve out-of-domain (OOD) datasets... varying λ1 (the averaging weight for the pre-trained model, called by Wortsman et al. [66]) from 0 to 1 in 0.1-step increments... We computed a diagonal Fisher approximation for each checkpoint using up to 4096 examples from the corresponding train set. Since it is not clear a priori what weighting coefﬁcients λi to use in this setting, we chose λi by a grid search with 50 points, using the score on the ﬁrst 2048 validation examples as the selection metric.