Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models

Authors: Shuai Zhao, Xiaohan Wang, Linchao Zhu, Yi Yang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments along with promising empirical results demonstrate the effectiveness of RLCF.
Researcher Affiliation Collaboration Shuai Zhao , Xiaohan Wang Linchao Zhu Yi Yang Re LER Lab, AAII, University of Technology Sydney Re LER Lab, CCAI, Zhejiang University Baidu Inc. {zhaoshuaimcc, wxh1996111}@gmail.com {zhulinchao, yangyics}@zju.edu.cn Part of this work is done during an internship at Baidu Inc.
Pseudocode No The paper describes the steps of the reinforcement learning process and various task-specific pipelines, but it does not present them in a formal pseudocode block or a clearly labeled algorithm format.
Open Source Code Yes The code is available at https://github.com/mzhaoshuai/RLCF.
Open Datasets Yes Datasets Following CLIP and TPT, we test RLCF on Image Net (Deng et al., 2009) and its four variant test sets with distribution shifts: Image Net-A (Hendrycks et al., 2021b), Image Net-V2 (Recht et al., 2019), Image Net-R (Hendrycks et al., 2021a), and Image Net-Sketch (Wang et al., 2019). ... we train the captioning model on MS-COCO train set (Lin et al., 2014) and test it on the test set of Flickr30K (Plummer et al., 2015) and validation set of No Caps (Agrawal et al., 2019).
Dataset Splits Yes To test the adaptation ability of RLCF for captioning models in a zero-shot or cross-domain condition, we train the captioning model on MS-COCO train set (Lin et al., 2014) and test it on the test set of Flickr30K (Plummer et al., 2015) and validation set of No Caps (Agrawal et al., 2019).
Hardware Specification Yes Test on Image Net-A and Image Net-V2 with a single NVIDIA 40GB A100 GPU.
Software Dependencies No The paper mentions the Adam W optimizer, but does not provide specific version numbers for any software libraries (e.g., PyTorch, TensorFlow), frameworks, or programming languages (e.g., Python version) used in the experiments.
Experiment Setup Yes For prompt tuning, the learning rate is 7e-3, the weight decay value is 5e-4, and the optimizer is Adam W (Loshchilov & Hutter, 2019). For image encoder tuning, the learning rate is decreased to 1e-5. Given a test sample, the parameters will be optimized for 3 steps to maximize the reward of the top-3 (sampling factor K = 3) predictions. The momentum coefficient m = 0.9998 and update interval Bs = 64 for the momentum buffer.