Task Arithmetic in the Tangent Space: Improved Editing of Pre-Trained Models
Authors: Guillermo Ortiz-Jimenez, Alessandro Favero, Pascal Frossard
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a comprehensive study of task arithmetic in vision-language models and show that weight disentanglement is the crucial factor that makes it effective. This leads to substantial performance improvements across multiple task arithmetic benchmarks and diverse models. Building on these findings, we provide theoretical and empirical analyses of the neural tangent kernel (NTK) of these models and establish a compelling link between task arithmetic and the spatial localization of the NTK eigenfunctions. |
| Researcher Affiliation | Academia | Guillermo Ortiz-Jimenez EPFL, Lausanne, Switzerland guillermo.ortizjimenez@epfl.ch Alessandro Favero EPFL, Lausanne, Switzerland alessandro.favero@epfl.ch Pascal Frossard EPFL, Lausanne, Switzerland pascal.frossard@epfl.ch |
| Pseudocode | Yes | Listing 1: Basic Py Torch code to linearize a model. |
| Open Source Code | Yes | The code to reproduce our experiments can be found at https://github.com/gortizji/tangent_ task_arithmetic. |
| Open Datasets | Yes | We fine-tune (FT) several CLIP pre-trained Vision Transformers (Vi Ts) [24] of different sizes following the same setup as Ilharco et al. [39] on 8 tasks: Cars [43], DTD [20], SUN397 [88], Euro SAT [33], GTSRB [80], MNIST [44], SVHN [60] and RESISC45 [15]. |
| Dataset Splits | Yes | The tuning of α is done independently for non-linear FT, linearized FT, and post-hoc linearization. As in Ilharco et al. [39] we use a single coefficient α to tune the size of the task vectors used to modify the pre-trained models. This is equivalent to setting α = α1 = . . . αT in Eq. (1). Both in the task addition and task negation benchmarks, after fine-tuning, we evaluate different scaling coefficients α {0.0, 0.05, 0.1, . . . , 1.0} and choose the value that achieves the highest target metric on a small held-out proportion of the training set as specified in Ilharco et al. [39]. |
| Hardware Specification | Yes | All our experiments were performed using the same hardware consisting of four V100 NVIDIA GPUs with 32GB of memory each and can be reproduced in less than 350 GPU hours. |
| Software Dependencies | No | The paper mentions using 'functorch sublibrary of Py Torch' and 'Adam W optimizer' but does not specify their version numbers or the version of Python used. |
| Experiment Setup | Yes | In particular, we fine-tune all datasets starting from the same CLIP pre-trained checkpoint downloaded from the open_clip repository [37]. We fine-tune for 2,000 iterations with a batch size of 128, learning rate of 10 5 and a cosine annealing learning rate schedule with 200 warm-up steps and the Adam W optimizer [49]. |