reproducibilityindex.ai

Patching open-vocabulary models by interpolating weights

Authors: Gabriel Ilharco, Mitchell Wortsman, Samir Yitzhak Gadre, Shuran Song, Hannaneh Hajishirzi, Simon Kornblith, Ali Farhadi, Ludwig Schmidt

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study model patching, where the goal is to improve accuracy on specific tasks without degrading accuracy on tasks where performance is already adequate. Towards this goal, we introduce PAINT, a patching method that uses interpolations between the weights of a model before fine-tuning and the weights after fine-tuning on a task to be patched. On nine tasks where zero-shot CLIP performs poorly, PAINT increases accuracy by 15 to 60 percentage points while preserving accuracy on Image Net within one percentage point of the zero-shot model. PAINT also allows a single model to be patched on multiple tasks and improves with model scale.
Researcher Affiliation	Collaboration	Gabriel Ilharco 1 Mitchell Wortsman 1 Samir Yitzhak Gadre 2 Shuran Song2 Hannaneh Hajishirzi1,3 Simon Kornblith4 Ali Farhadi1 Ludwig Schmidt1,3 1University of Washington 2Columbia University 3AI2 4Google Research, Brain Team
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Equal contribution. Code available at https://github.com/mlfoundations/patching.
Open Datasets	Yes	Tasks. We consider a diverse set of image classification tasks from Radford et al. [57]. In most experiments, we use Image Net [14] as a representative supported task, although we explore other supported tasks in Section 4.2. We categorize tasks into patching tasks or supported tasks based on the accuracy difference between the zero-shot model and a model specialized to the task. Specifically, we consider a subset tasks from Radford et al. [57], categorizing tasks where the linear probes outperform the zero-shot model by over 10 percentage points as patching tasks: Cars [35], DTD [11], Euro SAT [25], GTSRB [71], KITTI [22], MNIST [39], RESISC45 [7], SUN397 [84], and SVHN [53].
Dataset Splits	Yes	The mixing coefficient is determined via held-out validation sets for Dsupp and Dpatch. We refer to the resulting model as θpatch. Unless mentioned otherwise, we pick the mixing coefficient α that optimizes average accuracy on the held-out validation sets from the supported and patching tasks.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory) used to run its experiments.
Software Dependencies	No	The paper mentions using CLIP and the AdamW optimizer but does not specify version numbers for any software libraries or dependencies. It references PyTorch but does not state its direct use or version.
Experiment Setup	Yes	Unless otherwise mentioned, we fine-tune with a batch size of 128 for 2000 iterations using learning rate 1e-5 with 200 warm-up steps with a cosine annealing learning rate schedule and the Adam W optimizer [43, 55] (weight decay 0.1).