Learning Where to Edit Vision Transformers
Authors: Yunqiao Yang, Long-Kai Huang, Shengzhuang Chen, Kede Ma, Ying Wei
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our approach not only achieves superior performance on the proposed benchmark but also allows for adjustable trade-offs between generalization and locality. To validate our method, we construct an editing benchmark that introduces subpopulation shifts towards natural underrepresented images and AI-generated images, thereby revealing the limitations of pre-trained Vi Ts for object recognition. Our approach not only achieves superior performance on the proposed benchmark but also allows for adjustable trade-offs between generalization and locality. |
| Researcher Affiliation | Collaboration | Yunqiao Yang1 Long-Kai Huang2 Shengzhuang Chen1 Kede Ma1 Ying Wei3 1City University of Hong Kong 2Tencent AI Lab 3Zhejiang University |
| Pseudocode | Yes | Algorithm 1 presents the pseudo-code of our method. |
| Open Source Code | Yes | Our code is available at https://github.com/hustyyq/Where-to-Edit. |
| Open Datasets | Yes | To build the natural image subset, we first compile a large dataset of unlabeled images, denoted as U, from Flickr, by leveraging keywords relevant to the object categories in Image Net-1k [10]. We adopt Textural Inversion [56] and PUG [5] to construct the AI-generated image subset, encompassing the oil painting and stage light shifts, respectively. |
| Dataset Splits | Yes | Using the validation set from Image Net-1k as Dl does not adequately examine locality, as the majority are easy samples that lie far from the decision boundary [16]. To more closely examine the adverse effects of model editing, we have carefully curated 2, 071 images near the decision boundary of the base model from the validation sets of Image Net-1k [47], Image Net-R [25], and Image Net-Sketch [57], whose predictions are more susceptible to change. |
| Hardware Specification | Yes | Training a hypernetwork for the base Vi T/B-16 takes approximately 9 hours on a single RTX A6000 GPU (48G). |
| Software Dependencies | No | The paper mentions optimizers like Adam and RMSProp and references their theoretical basis, but does not provide specific version numbers for these or other key software components or libraries used in the implementation. |
| Experiment Setup | Yes | We set the learning rate in the inner loop as 0.001, and perform gradient descent for five steps (i.e., T = 5). In the outer loop, we apply the Adam optimizer with a learning rate of 0.1 to optimize m from random initialization for a total of ten steps. For the hypernetwork optimization, RMSProp5 is utilized with a learning rate of 10 4, a minibatch size of eight, and a maximum iteration number of 7, 000. |