Localizing Task Information for Improved Model Merging and Compression
Authors: Ke Wang, Nikolaos Dimitriadis, Guillermo Ortiz-Jimenez, François Fleuret, Pascal Frossard
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments in vision and NLP benchmarks with up to 20 tasks, show that Consensus Merging consistently improves existing approaches. Furthermore, our proposed compression scheme reduces storage from 57Gb to 8.2Gb while retaining 99.7% of original performance. |
| Researcher Affiliation | Collaboration | 1 Ecole Polytechnique F ed erale de Lausanne, Lausanne, Switzerland 2Google Deepmind 3Work done while at EPFL. 4University of Geneva, Geneva, Switzerland. |
| Pseudocode | No | The paper describes algorithms using equations and text but does not include a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | The source code can be found at https://github.com/nik-dim/tall_masks. |
| Open Datasets | Yes | For the 8-task vision benchmark proposed by Ilharco et al. (2023), we randomly select a subset of weights for each task and perform gradient updates only for those parameters...Specifically, for the 8-task vision benchmark proposed by Ilharco et al. (2023)... |
| Dataset Splits | Yes | The results for this control experiment are presented in Table 1, compared with task arithmetic where the models are fine-tuned in a standard way. Looking at the normalized accuracy, defined in Appendix A, we observe that the performance of task arithmetic in the controlled setting deteriorates at the same rate as standard fine-tuning, where the accuracy of the merged model is 2.7% worse than standard case...We validate the efficacy of our mask construction by checking if the original performance in the same 8-task computer vision benchmark, evaluated on a held-out dataset, can be restored...Note that λt is selected based on the validation accuracy of each task respectively, allowing for the task-specific problems to be solved in parallel and independently. |
| Hardware Specification | Yes | All our experiments were performed using the same hardware consisting of four V100 NVIDIA GPUs with 32GB of memory each. |
| Software Dependencies | No | The paper mentions software like "Adam W optimizer" and "CLIP model variants" but does not specify their version numbers, which is required for reproducibility. |
| Experiment Setup | Yes | Specifically, we fine-tune the same pre-trained CLIP checkpoint obtained from the openclip repository (Ilharco et al., 2021). We fine-tune for 2,000 iterations, using a batch size of 128, a learning rate of 1e 5, and a cosine annealing learning rate schedule with 200 warm-up steps, along with the Adam W optimizer...For constructing task-specific masks, we tune the hyper-parameter λ for each task over {0.2, 0.3, 0.4, 0.5, 0.6}...The scaling factor is tuned over a range of {0.0, 0.1, ..., 0.9, 1.0}, selected based on the performance on the validation set averaged on all tasks. |