Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Editing models with task arithmetic
Authors: Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, Ali Farhadi
ICLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Overall, our experiments with several models, modalities and tasks show that task arithmetic is a simple, efficient and effective way of editing models. |
| Researcher Affiliation | Collaboration | 1University of Washington 2Microsoft Research 3Allen Institute for AI |
| Pseudocode | No | The paper describes its methods in prose and mathematical formulas but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code available at https://github.com/mlfoundations/task_vectors. |
| Open Datasets | Yes | For image classification, we use CLIP models [78] and task vectors from eight tasks studied by Ilharco et al. [39]; Radford et al. [78], ranging from satellite imagery recognition to classifying traffic signs: Cars [47], DTD [12], Euro SAT [36], GTSRB [87], MNIST [51], RESISC45 [10], SUN397 [101], and SVHN [72]. For the control task, we use Image Net [16]. |
| Dataset Splits | Yes | For all operations, the model weights obtained by applying θnew = θ +λτnew, where the scaling term λ is determined using held-out validation sets. |
| Hardware Specification | No | The paper does not specify the hardware used for running the experiments, such as specific GPU or CPU models. |
| Software Dependencies | No | The paper mentions software like 'GPT-2 models [77] from Hugging Face transformers library [97]' and refers to 'PyTorch' [75] and 'Adam W optimizer [58; 75]', but it does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | We fine-tune for 2000 iterations with a batch size of 128, learning rate 1e-5 and a cosine annealing learning rate schedule with 200 warm-up steps and the Adam W optimizer [58; 75], with weight decay 0.1. |