Understanding Retrieval-Augmented Task Adaptation for Vision-Language Models

Authors: Yifei Ming, Yixuan Li

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we adopt a reflective perspective by presenting a systematic study to understand the roles of key components in retrieval-augmented adaptation. We unveil new insights on uni-modal and cross-modal retrieval and highlight the critical role of logit ensemble for effective adaptation. We further present theoretical underpinnings that directly support our empirical observations.
Researcher Affiliation Academia 1Department of Computer Sciences, University of Wisconsin Madison. Correspondence to: Yifei Ming <ming5@wisc.edu>, Yixuan Li <sharonli@cs.wisc.edu>.
Pseudocode No The paper does not contain explicit pseudocode or algorithm blocks.
Open Source Code No The paper mentions using a third-party tool: "We use the clip-retrieval tool1 for efficient retrieval from LAION5B. https://github.com/rom1504/clip-retrieval". However, it does not state that the authors' own implementation code is open-source or provide a link to it.
Open Datasets Yes Datasets. Following prior works (Zhang et al., 2022a), we consider a wide range of real-world datasets that span both common and finer-grained categories: Caltech101 (Fei-Fei et al., 2004), Birds200 (Wah et al., 2011), Food101 (Bossard et al., 2014), Oxford Pets (Parkhi et al., 2012), Flowers102 (Nilsback & Zisserman, 2008), Textures (Cimpoi et al., 2014), and UCF101 (Soomro et al., 2012).
Dataset Splits Yes For each target dataset, the train, validation, and test split also follow (Zhang et al., 2022a). ... The ensemble weights of two logits α, γ are tuned on the validation set.
Hardware Specification Yes We run all experiments on NVIDIA Ge Force RTX-A6000 GPU.
Software Dependencies No The paper mentions "Py Torch 1.12" but does not list multiple key software components with their versions, nor does it provide a specific version number for the "clip-retrieval tool" mentioned.
Experiment Setup Yes We vary the number of retrieved samples per class K {1, 2, 4, 8, 16}. For adaptation, we use pre-trained CLIP with RN50 backbone as the default. ... We use 8 seed images per class as the query set. ... We use Adam W (Loshchilov & Hutter, 2019) as the optimizer with a cosine scheduler. The initial learning rate is set as 0.001 and we finetune for 20 epochs. The hyperparameters such as α, ω, γ are determined based on the validation split of each target dataset.