Model Merging by Uncertainty-Based Gradient Matching

Authors: Nico Daheim, Thomas Möllenhoff, Edoardo Ponti, Iryna Gurevych, Mohammad Emtiyaz Khan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our new method gives consistent improvements for large language models and vision transformers, both in terms of performance and robustness to hyperparameters. Code available here. Empirical results on LLMs and Vi Ts show consistent improvements, both in terms of performance and robustness to hyperparameters.
Researcher Affiliation Academia 1Ubiquitous Knowledge Processing Lab (UKP Lab) Department of Computer Science and Hessian Center for AI (hessian.AI) Technical University of Darmstadt 2RIKEN Center for Advanced Intelligence Project, Tokyo, Japan 3University of Edinburgh
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No Code available here. We will also provide a repository containing the implementation upon acceptance.
Open Datasets Yes We use a pretrained Vi T for image classification and add eight datasets to it: Cars (Krause et al., 2013), DTD (Cimpoi et al., 2014), Euro SAT (Helber et al., 2018), GTSRB (Houben et al., 2013), MNIST (Le Cun, 1998), RESISC45 (Cheng et al., 2017), SUN397 (Xiao et al., 2010), and SVHN (Yuval, 2011), replicating the method and datasets used in Ilharco et al. (2023). We first train on IMDB (Maas et al., 2011).
Dataset Splits No The paper mentions tuning αt on a validation set as a practice but does not provide specific details on validation dataset splits used in their experiments (e.g., exact percentages or sample counts).
Hardware Specification No All models are trained using a large batch size of 128 on NVIDIA GPUs. We truncate the inputs at 384 tokens, and train using a batch size of 16 on NVIDIA GPUs.
Software Dependencies No All models are trained using Adam W (Loshchilov & Hutter, 2019) or a modified version of Adam (Kingma & Ba, 2015) with a decoupled quadratic penalty. Furthermore, we set β1 = 0.9 and β2 = 0.999 as is standard in the transformers library.
Experiment Setup Yes We train for 5 epochs for small datasets and 10 for larger ones. We use a learning rate of 1e 3, β1 = 0.9 and β2 = 0.999. We train each model for 2 epochs on IMDB and Yelp, 1 epoch on Amazon, and 5 epochs on the smaller SST2 and Rotten Tomatoes datasets. We use a learning rate of 1e 5 for training Ro BERTa-base on IMDB and of 5e 6 for training the other models initialized from the IMDB model. We truncate the inputs at 384 tokens, and train using a batch size of 16. We set β1 = 0.9 and β2 = 0.999 as is standard in the transformers library, and use 100 warmup steps, as well as gradient norm clipping to unit norm. We use Adam W or the modified version with a learning rate of 6 25e 5, β1 = 0.9, and β2 = 0.999.