reproducibilityindex.ai

Connecting the Dots: Collaborative Fine-tuning for Black-Box Vision-Language Models

Authors: Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, Tieniu Tan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on fewshot classification over 15 datasets demonstrate the superiority of Cra FT. The results show that Cra FT achieves a decent gain of about 12% with 16-shot datasets and only 8,000 queries. Moreover, Cra FT trains faster and uses only about 1/80 of the memory footprint for deployment, while sacrificing only 1.62% compared to the whitebox method.
Researcher Affiliation	Academia	1University of Science and Technology of China 2NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences 3School of Artificial Intelligence, University of Chinese Academy of Sciences 4Nanjing University.
Pseudocode	Yes	Algorithm 1 Collaborative Training for CLIP
Open Source Code	Yes	Our code is publicly available at https://github.com/mrflogs/Cra FT.
Open Datasets	Yes	In accordance with Co Op (Zhou et al., 2022b), we adopt 11 distinct image classification datasets to investigate few-shot learning. These datasets encompass various domains of image classification, including generic object recognition with Image Net (Deng et al., 2009) and Caltech101 (Li et al., 2004)...
Dataset Splits	No	Specifically, we have trained models using 1, 2, 4, 8, and 16 shots and evaluated them on the full test sets. The paper mentions 'training' and 'test' sets but does not explicitly specify a 'validation' split with reproducible details (e.g., percentages, counts, or how it was used to tune hyperparameters).
Hardware Specification	Yes	All experiments are conducted on a single NVIDIA Ge Force RTX 3090.
Software Dependencies	No	The paper mentions software components like 'CLIP', 'Res Net-50', 'transformer', and 'Adam W optimizer' but does not specify their version numbers.
Experiment Setup	Yes	To optimize the text prompts in the prompt generation module, we used the CMA-ES algorithm and set the prompt length to 4. The text prompts are projected into a subspace of dimension 512 using a random matrix sampled from a Gaussian distribution N(0, 0.02). The population size is set to 40, with a budget of 8,000 API calls. For the prediction refinement module, we use a three-layer MLP with a hidden dimension of 512 as the refinement network. We set the hyper-parameters λI and λO to 0.1 divided by the number of classes by default. The prediction refinement module is optimized using the Adam W optimizer with a learning rate of 0.001, and we set the batch size as 256 during training.