Knowledge Composition using Task Vectors with Learned Anisotropic Scaling

Authors: Frederic Z. Zhang, Paul Albert, Cristian Rodriguez-Opazo, Anton van den Hengel, Ehsan Abbasnejad

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of our method in task arithmetic, few-shot recognition and test-time adaptation, with supervised or unsupervised objectives. In particular, we show that (1) learned anisotropic scaling allows task vectors to be more disentangled, causing less interference in composition; (2) task vector composition excels with scarce or no labelled data and is less prone to domain shift, thus leading to better generalisability; (3) mixing the most informative parameter blocks across different task vectors prior to training can reduce the memory footprint and improve the flexibility of knowledge transfer. Moreover, we show the potential of a TLAS as a parameter-efficient fine-tuning method, particularly with less data, and demonstrate that it can be easily scaled up for higher performance.
Researcher Affiliation Academia Australian Institute for Machine Learning The University of Adelaide {firstname.lastname}@adelaide.edu.au
Pseudocode No The paper describes the algorithm using mathematical equations (e.g., Eqs. 4 and 5) but does not present it in a pseudocode block or algorithm box.
Open Source Code Yes https://github.com/fredzzhang/atlas
Open Datasets Yes We acquire task vectors by fine-tuning CLIP [47] on a variety of 22 image recognition datasets: (1) Stanford Cars [30], (2) DTD [11], (3) Euro SAT [20], (4) GTSRB [56], (5) MNIST [32], (6) RESISC45 [10], (7) SUN397 [63], (8) SVHN [41], (9) CIFAR10 [31], (10) CIFAR100 [31], (11) Image Net [52], (12) STL10 [12], (13) Food101 [5], (14) Caltech101 [34], (15) Caltech256 [17], (16) FGVCAircraft [39], (17) Flowers102 [42], (18) Oxford Pets [45], (19) CUB200 [61], (20) Pascal VOC [15], (21) Country211 [47], and (22) UCF101 [55]. Fine-tuning was conducted using Adam W optimiser [38], with a learning rate of 10-5, batch size of 128 and weight decay of 0.1. Details of the datasets, additional dataset-specific hyper-parameters, and the accuracy after fine-tuning for an assortment of backbones are shown in Table 5.
Dataset Splits Yes Table 5: Details of the 22 image classification datasets used in experiments, the number of epochs for fine-tuning and the final accuracy for different backbones of the CLIP model. train val test
Hardware Specification Yes Although we do not report compute requirements in the paper, on a single A100 or 4090GPU except for Vi T-L/14 experiments that were performed on 2 A100.
Software Dependencies No The paper mentions "Adam W optimiser [38]", "Lo RA-Torch [36] library", and "nevergrad [49] library" but does not provide specific version numbers for these or other key software components like PyTorch (which is implied by the PyTorch memory function reference but no version is given).
Experiment Setup Yes Fine-tuning was conducted using Adam W optimiser [38], with a learning rate of 10-5, batch size of 128 and weight decay of 0.1.