Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models
Authors: Shuai Zhao, Xiaohan Wang, Linchao Zhu, Yi Yang
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments along with promising empirical results demonstrate the effectiveness of RLCF. |
| Researcher Affiliation | Collaboration | Shuai Zhao , Xiaohan Wang Linchao Zhu Yi Yang Re LER Lab, AAII, University of Technology Sydney Re LER Lab, CCAI, Zhejiang University Baidu Inc. EMAIL EMAIL Part of this work is done during an internship at Baidu Inc. |
| Pseudocode | No | The paper describes the steps of the reinforcement learning process and various task-specific pipelines, but it does not present them in a formal pseudocode block or a clearly labeled algorithm format. |
| Open Source Code | Yes | The code is available at https://github.com/mzhaoshuai/RLCF. |
| Open Datasets | Yes | Datasets Following CLIP and TPT, we test RLCF on Image Net (Deng et al., 2009) and its four variant test sets with distribution shifts: Image Net-A (Hendrycks et al., 2021b), Image Net-V2 (Recht et al., 2019), Image Net-R (Hendrycks et al., 2021a), and Image Net-Sketch (Wang et al., 2019). ... we train the captioning model on MS-COCO train set (Lin et al., 2014) and test it on the test set of Flickr30K (Plummer et al., 2015) and validation set of No Caps (Agrawal et al., 2019). |
| Dataset Splits | Yes | To test the adaptation ability of RLCF for captioning models in a zero-shot or cross-domain condition, we train the captioning model on MS-COCO train set (Lin et al., 2014) and test it on the test set of Flickr30K (Plummer et al., 2015) and validation set of No Caps (Agrawal et al., 2019). |
| Hardware Specification | Yes | Test on Image Net-A and Image Net-V2 with a single NVIDIA 40GB A100 GPU. |
| Software Dependencies | No | The paper mentions the Adam W optimizer, but does not provide specific version numbers for any software libraries (e.g., PyTorch, TensorFlow), frameworks, or programming languages (e.g., Python version) used in the experiments. |
| Experiment Setup | Yes | For prompt tuning, the learning rate is 7e-3, the weight decay value is 5e-4, and the optimizer is Adam W (Loshchilov & Hutter, 2019). For image encoder tuning, the learning rate is decreased to 1e-5. Given a test sample, the parameters will be optimized for 3 steps to maximize the reward of the top-3 (sampling factor K = 3) predictions. The momentum coefficient m = 0.9998 and update interval Bs = 64 for the momentum buffer. |