Model Merging by Uncertainty-Based Gradient Matching
Authors: Nico Daheim, Thomas Möllenhoff, Edoardo Ponti, Iryna Gurevych, Mohammad Emtiyaz Khan
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our new method gives consistent improvements for large language models and vision transformers, both in terms of performance and robustness to hyperparameters. Code available here. Empirical results on LLMs and Vi Ts show consistent improvements, both in terms of performance and robustness to hyperparameters. |
| Researcher Affiliation | Academia | 1Ubiquitous Knowledge Processing Lab (UKP Lab) Department of Computer Science and Hessian Center for AI (hessian.AI) Technical University of Darmstadt 2RIKEN Center for Advanced Intelligence Project, Tokyo, Japan 3University of Edinburgh |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | Code available here. We will also provide a repository containing the implementation upon acceptance. |
| Open Datasets | Yes | We use a pretrained Vi T for image classification and add eight datasets to it: Cars (Krause et al., 2013), DTD (Cimpoi et al., 2014), Euro SAT (Helber et al., 2018), GTSRB (Houben et al., 2013), MNIST (Le Cun, 1998), RESISC45 (Cheng et al., 2017), SUN397 (Xiao et al., 2010), and SVHN (Yuval, 2011), replicating the method and datasets used in Ilharco et al. (2023). We first train on IMDB (Maas et al., 2011). |
| Dataset Splits | No | The paper mentions tuning αt on a validation set as a practice but does not provide specific details on validation dataset splits used in their experiments (e.g., exact percentages or sample counts). |
| Hardware Specification | No | All models are trained using a large batch size of 128 on NVIDIA GPUs. We truncate the inputs at 384 tokens, and train using a batch size of 16 on NVIDIA GPUs. |
| Software Dependencies | No | All models are trained using Adam W (Loshchilov & Hutter, 2019) or a modified version of Adam (Kingma & Ba, 2015) with a decoupled quadratic penalty. Furthermore, we set β1 = 0.9 and β2 = 0.999 as is standard in the transformers library. |
| Experiment Setup | Yes | We train for 5 epochs for small datasets and 10 for larger ones. We use a learning rate of 1e 3, β1 = 0.9 and β2 = 0.999. We train each model for 2 epochs on IMDB and Yelp, 1 epoch on Amazon, and 5 epochs on the smaller SST2 and Rotten Tomatoes datasets. We use a learning rate of 1e 5 for training Ro BERTa-base on IMDB and of 5e 6 for training the other models initialized from the IMDB model. We truncate the inputs at 384 tokens, and train using a batch size of 16. We set β1 = 0.9 and β2 = 0.999 as is standard in the transformers library, and use 100 warmup steps, as well as gradient norm clipping to unit norm. We use Adam W or the modified version with a learning rate of 6 25e 5, β1 = 0.9, and β2 = 0.999. |