TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration

Authors: Yiwei Guo, Shaobin Zhuang, Kunchang Li, Yu Qiao, Yali Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we conduct extensive experiments on 11 visual recognition benchmarks, where our Trans Agent achieves the state-of-the-art under the same low-shot transfer setting, e.g., via knowledge collaboration, it outperforms the well-known Co Op [87] with around 10% on average and 20% on Euro SAT which contains large domain shifts.
Researcher Affiliation Collaboration 1Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3Shanghai AI Laboratory 4Shanghai Jiao Tong University
Pseudocode No The paper describes the proposed framework using textual descriptions and diagrams, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code will be released at https://github.com/markywg/transagent.
Open Datasets Yes We evaluate our proposed method on 11 commonly used datasets covering a wide range of recognition tasks, including Image Net [21], Caltech101 [29], Oxford Pets [58], Stanford Cars [44], Flowers102 [57], Food101 [6], FGVCAircraft [55], SUN397 [73], UCF101 [67], DTD [19] and Euro SAT [35].
Dataset Splits Yes We explore two typical low-shot scenarios to evaluate the performance. (i) Base-to-novel generalization: The datasets are equally split into base and novel classes. The model is trained on base classes and evaluated on the test set of both classes. We report the base and novel class accuracy and the harmonic mean (HM) of the results. (ii) Few-shot classification: We assess the accuracy trained with 1/2/4/8/16 shot(s) per class to examine the model s learning capacity.
Hardware Specification Yes All experiments are conducted on a single Nvidia A6000 GPU.
Software Dependencies No The paper mentions 'CLIP Vi T/B-16 as our backbone' but does not specify software dependencies like Python, PyTorch, or CUDA versions with specific version numbers.
Experiment Setup Yes The number of learnable vision and language prompt tokens are both set to 4, and the prompt depth is set to 9 for base-to-novel generalization and few-shot classification, and 3 for cross-dataset and domain generalization. The learnable text prompts of the first layer are initialized with the word embeddings of "a photo of a", while the other learnable prompts are randomly initialized with a normal distribution... For few-shot classification, we train the models for 50 epochs under different low-shot settings (ranging from 1 to 16)... all models are trained for 20 epochs using 16-shot samples with a fixed batch size of 4 and a learning rate of 0.0025 with SGD as the optimizer. We set λ1 = 1, λ2 = 25 and λ3 = 1 in Eq. 9 after extensive hyperparameter search to balance the total loss.