Connecting the Dots: Collaborative Fine-tuning for Black-Box Vision-Language Models
Authors: Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, Tieniu Tan
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on fewshot classification over 15 datasets demonstrate the superiority of Cra FT. The results show that Cra FT achieves a decent gain of about 12% with 16-shot datasets and only 8,000 queries. Moreover, Cra FT trains faster and uses only about 1/80 of the memory footprint for deployment, while sacrificing only 1.62% compared to the whitebox method. |
| Researcher Affiliation | Academia | 1University of Science and Technology of China 2NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences 3School of Artificial Intelligence, University of Chinese Academy of Sciences 4Nanjing University. |
| Pseudocode | Yes | Algorithm 1 Collaborative Training for CLIP |
| Open Source Code | Yes | Our code is publicly available at https://github.com/mrflogs/Cra FT. |
| Open Datasets | Yes | In accordance with Co Op (Zhou et al., 2022b), we adopt 11 distinct image classification datasets to investigate few-shot learning. These datasets encompass various domains of image classification, including generic object recognition with Image Net (Deng et al., 2009) and Caltech101 (Li et al., 2004)... |
| Dataset Splits | No | Specifically, we have trained models using 1, 2, 4, 8, and 16 shots and evaluated them on the full test sets. The paper mentions 'training' and 'test' sets but does not explicitly specify a 'validation' split with reproducible details (e.g., percentages, counts, or how it was used to tune hyperparameters). |
| Hardware Specification | Yes | All experiments are conducted on a single NVIDIA Ge Force RTX 3090. |
| Software Dependencies | No | The paper mentions software components like 'CLIP', 'Res Net-50', 'transformer', and 'Adam W optimizer' but does not specify their version numbers. |
| Experiment Setup | Yes | To optimize the text prompts in the prompt generation module, we used the CMA-ES algorithm and set the prompt length to 4. The text prompts are projected into a subspace of dimension 512 using a random matrix sampled from a Gaussian distribution N(0, 0.02). The population size is set to 40, with a budget of 8,000 API calls. For the prediction refinement module, we use a three-layer MLP with a hidden dimension of 512 as the refinement network. We set the hyper-parameters λI and λO to 0.1 divided by the number of classes by default. The prediction refinement module is optimized using the Adam W optimizer with a learning rate of 0.001, and we set the batch size as 256 during training. |