Task Arithmetic in the Tangent Space: Improved Editing of Pre-Trained Models

Authors: Guillermo Ortiz-Jimenez, Alessandro Favero, Pascal Frossard

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present a comprehensive study of task arithmetic in vision-language models and show that weight disentanglement is the crucial factor that makes it effective. This leads to substantial performance improvements across multiple task arithmetic benchmarks and diverse models. Building on these findings, we provide theoretical and empirical analyses of the neural tangent kernel (NTK) of these models and establish a compelling link between task arithmetic and the spatial localization of the NTK eigenfunctions.
Researcher Affiliation Academia Guillermo Ortiz-Jimenez EPFL, Lausanne, Switzerland guillermo.ortizjimenez@epfl.ch Alessandro Favero EPFL, Lausanne, Switzerland alessandro.favero@epfl.ch Pascal Frossard EPFL, Lausanne, Switzerland pascal.frossard@epfl.ch
Pseudocode Yes Listing 1: Basic Py Torch code to linearize a model.
Open Source Code Yes The code to reproduce our experiments can be found at https://github.com/gortizji/tangent_ task_arithmetic.
Open Datasets Yes We fine-tune (FT) several CLIP pre-trained Vision Transformers (Vi Ts) [24] of different sizes following the same setup as Ilharco et al. [39] on 8 tasks: Cars [43], DTD [20], SUN397 [88], Euro SAT [33], GTSRB [80], MNIST [44], SVHN [60] and RESISC45 [15].
Dataset Splits Yes The tuning of α is done independently for non-linear FT, linearized FT, and post-hoc linearization. As in Ilharco et al. [39] we use a single coefficient α to tune the size of the task vectors used to modify the pre-trained models. This is equivalent to setting α = α1 = . . . αT in Eq. (1). Both in the task addition and task negation benchmarks, after fine-tuning, we evaluate different scaling coefficients α {0.0, 0.05, 0.1, . . . , 1.0} and choose the value that achieves the highest target metric on a small held-out proportion of the training set as specified in Ilharco et al. [39].
Hardware Specification Yes All our experiments were performed using the same hardware consisting of four V100 NVIDIA GPUs with 32GB of memory each and can be reproduced in less than 350 GPU hours.
Software Dependencies No The paper mentions using 'functorch sublibrary of Py Torch' and 'Adam W optimizer' but does not specify their version numbers or the version of Python used.
Experiment Setup Yes In particular, we fine-tune all datasets starting from the same CLIP pre-trained checkpoint downloaded from the open_clip repository [37]. We fine-tune for 2,000 iterations with a batch size of 128, learning rate of 10 5 and a cosine annealing learning rate schedule with 200 warm-up steps and the Adam W optimizer [49].