Understanding Retrieval-Augmented Task Adaptation for Vision-Language Models
Authors: Yifei Ming, Yixuan Li
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we adopt a reflective perspective by presenting a systematic study to understand the roles of key components in retrieval-augmented adaptation. We unveil new insights on uni-modal and cross-modal retrieval and highlight the critical role of logit ensemble for effective adaptation. We further present theoretical underpinnings that directly support our empirical observations. |
| Researcher Affiliation | Academia | 1Department of Computer Sciences, University of Wisconsin Madison. Correspondence to: Yifei Ming <ming5@wisc.edu>, Yixuan Li <sharonli@cs.wisc.edu>. |
| Pseudocode | No | The paper does not contain explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions using a third-party tool: "We use the clip-retrieval tool1 for efficient retrieval from LAION5B. https://github.com/rom1504/clip-retrieval". However, it does not state that the authors' own implementation code is open-source or provide a link to it. |
| Open Datasets | Yes | Datasets. Following prior works (Zhang et al., 2022a), we consider a wide range of real-world datasets that span both common and finer-grained categories: Caltech101 (Fei-Fei et al., 2004), Birds200 (Wah et al., 2011), Food101 (Bossard et al., 2014), Oxford Pets (Parkhi et al., 2012), Flowers102 (Nilsback & Zisserman, 2008), Textures (Cimpoi et al., 2014), and UCF101 (Soomro et al., 2012). |
| Dataset Splits | Yes | For each target dataset, the train, validation, and test split also follow (Zhang et al., 2022a). ... The ensemble weights of two logits α, γ are tuned on the validation set. |
| Hardware Specification | Yes | We run all experiments on NVIDIA Ge Force RTX-A6000 GPU. |
| Software Dependencies | No | The paper mentions "Py Torch 1.12" but does not list multiple key software components with their versions, nor does it provide a specific version number for the "clip-retrieval tool" mentioned. |
| Experiment Setup | Yes | We vary the number of retrieved samples per class K {1, 2, 4, 8, 16}. For adaptation, we use pre-trained CLIP with RN50 backbone as the default. ... We use 8 seed images per class as the query set. ... We use Adam W (Loshchilov & Hutter, 2019) as the optimizer with a cosine scheduler. The initial learning rate is set as 0.001 and we finetune for 20 epochs. The hyperparameters such as α, ω, γ are determined based on the validation split of each target dataset. |