Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models
Authors: Shuai Zhao, Xiaohan Wang, Linchao Zhu, Yi Yang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments along with promising empirical results demonstrate the effectiveness of RLCF. |
| Researcher Affiliation | Collaboration | Shuai Zhao , Xiaohan Wang Linchao Zhu Yi Yang Re LER Lab, AAII, University of Technology Sydney Re LER Lab, CCAI, Zhejiang University Baidu Inc. {zhaoshuaimcc, wxh1996111}@gmail.com {zhulinchao, yangyics}@zju.edu.cn Part of this work is done during an internship at Baidu Inc. |
| Pseudocode | No | The paper describes the steps of the reinforcement learning process and various task-specific pipelines, but it does not present them in a formal pseudocode block or a clearly labeled algorithm format. |
| Open Source Code | Yes | The code is available at https://github.com/mzhaoshuai/RLCF. |
| Open Datasets | Yes | Datasets Following CLIP and TPT, we test RLCF on Image Net (Deng et al., 2009) and its four variant test sets with distribution shifts: Image Net-A (Hendrycks et al., 2021b), Image Net-V2 (Recht et al., 2019), Image Net-R (Hendrycks et al., 2021a), and Image Net-Sketch (Wang et al., 2019). ... we train the captioning model on MS-COCO train set (Lin et al., 2014) and test it on the test set of Flickr30K (Plummer et al., 2015) and validation set of No Caps (Agrawal et al., 2019). |
| Dataset Splits | Yes | To test the adaptation ability of RLCF for captioning models in a zero-shot or cross-domain condition, we train the captioning model on MS-COCO train set (Lin et al., 2014) and test it on the test set of Flickr30K (Plummer et al., 2015) and validation set of No Caps (Agrawal et al., 2019). |
| Hardware Specification | Yes | Test on Image Net-A and Image Net-V2 with a single NVIDIA 40GB A100 GPU. |
| Software Dependencies | No | The paper mentions the Adam W optimizer, but does not provide specific version numbers for any software libraries (e.g., PyTorch, TensorFlow), frameworks, or programming languages (e.g., Python version) used in the experiments. |
| Experiment Setup | Yes | For prompt tuning, the learning rate is 7e-3, the weight decay value is 5e-4, and the optimizer is Adam W (Loshchilov & Hutter, 2019). For image encoder tuning, the learning rate is decreased to 1e-5. Given a test sample, the parameters will be optimized for 3 steps to maximize the reward of the top-3 (sampling factor K = 3) predictions. The momentum coefficient m = 0.9998 and update interval Bs = 64 for the momentum buffer. |