Dataless Knowledge Fusion by Merging Weights of Language Models

Authors: Xisen Jin, Xiang Ren, Daniel Preotiuc-Pietro, Pengxiang Cheng

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Over a battery of evaluation settings, we show that the proposed method significantly outperforms baselines such as Fisher-weighted averaging or model ensembling. Further, we find that our method is a promising alternative to multi-task learning that can preserve or sometimes improve over the individual models without access to the training data. The experimental results across multiple model types (e.g. RoBERTa, T5, DeBERTa) show that our proposed method consistently and significantly outperforms other model merging and ensembling baselines and achieves higher generalization performance than the best individual models on out-of-domain data sets across several data collections.
Researcher Affiliation Collaboration Xisen Jin , Xiang Ren , Daniel Preot,iuc-Pietro , Pengxiang Cheng University of Southern California Bloomberg {xisenjin, xiangren}@usc.edu {dpreotiucpie, pcheng134}@bloomberg.net
Pseudocode Yes Algorithm 1: Reg Mean for Transformer Language Models
Open Source Code Yes 1The code is available at: https://github.com/bloomberg/dataless-model-merging
Open Datasets Yes We use the GLUE datasets (Wang et al., 2018) for studying merging models trained for non-i.i.d. partitions and merging models trained for different tasks. For emotion classification, we use the collection of preprocessed datasets from (Oberländer & Klinger, 2018). For NER tasks, we use 6 domains in OntoNotes (Hovy et al., 2006) for training individual models, and use CoNLL (Sang & De Meulder, 2003) and Twitter NER (Rijhwani & Preotiuc-Pietro, 2020) to measure out-of-domain generalization performance.
Dataset Splits Yes For each task, we split training data into two partitions with 1,000 training examples with different label distributions (details in Appendix B). The merged models are evaluated on the official validation sets (i.e. with a joint distribution of both partitions). Appendix B includes tables (e.g., Table 4: 'Statistics of emotion classification datasets.') that explicitly list 'Train Dev Test' splits with corresponding numerical sizes, such as 'Dialy Dialog 72,085 10,298 20,596'.
Hardware Specification No The paper does not specify the exact hardware used for its experiments, such as GPU models (e.g., NVIDIA A100, RTX series), CPU models (e.g., Intel Xeon, AMD Ryzen), or specific cloud computing instance types with their configurations. It discusses pre-trained models and training details but omits hardware specifications.
Software Dependencies No The paper mentions key software components like 'huggingface’s transformer library (Wolf et al., 2019)' and 'PyTorch (Paszke et al., 2019)'. However, it does not provide specific version numbers for these or any other ancillary software dependencies, which would be necessary for full reproducibility.
Experiment Setup Yes We fine-tune DistilBERT-base, RoBERTa-base, and DeBERTa-large with an initial learning rate 1e-5, and fine-tune T5-base with an initial learning rate 1e-4. We use AdamW optimizer throughout the experiments. The learning rate gradually warms up in the first 6% of training steps and linearly decay to 0. We train models with a batch size of 16 and for 10 epochs on GLUE, 30 epochs on emotion classification and 20 epochs on NER. We set the non-diagonal multiplier α in Reg Mean to 0.9, with the exception of T5-base models, where it is 0.1.