TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration
Authors: Yiwei Guo, Shaobin Zhuang, Kunchang Li, Yu Qiao, Yali Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we conduct extensive experiments on 11 visual recognition benchmarks, where our Trans Agent achieves the state-of-the-art under the same low-shot transfer setting, e.g., via knowledge collaboration, it outperforms the well-known Co Op [87] with around 10% on average and 20% on Euro SAT which contains large domain shifts. |
| Researcher Affiliation | Collaboration | 1Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3Shanghai AI Laboratory 4Shanghai Jiao Tong University |
| Pseudocode | No | The paper describes the proposed framework using textual descriptions and diagrams, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code will be released at https://github.com/markywg/transagent. |
| Open Datasets | Yes | We evaluate our proposed method on 11 commonly used datasets covering a wide range of recognition tasks, including Image Net [21], Caltech101 [29], Oxford Pets [58], Stanford Cars [44], Flowers102 [57], Food101 [6], FGVCAircraft [55], SUN397 [73], UCF101 [67], DTD [19] and Euro SAT [35]. |
| Dataset Splits | Yes | We explore two typical low-shot scenarios to evaluate the performance. (i) Base-to-novel generalization: The datasets are equally split into base and novel classes. The model is trained on base classes and evaluated on the test set of both classes. We report the base and novel class accuracy and the harmonic mean (HM) of the results. (ii) Few-shot classification: We assess the accuracy trained with 1/2/4/8/16 shot(s) per class to examine the model s learning capacity. |
| Hardware Specification | Yes | All experiments are conducted on a single Nvidia A6000 GPU. |
| Software Dependencies | No | The paper mentions 'CLIP Vi T/B-16 as our backbone' but does not specify software dependencies like Python, PyTorch, or CUDA versions with specific version numbers. |
| Experiment Setup | Yes | The number of learnable vision and language prompt tokens are both set to 4, and the prompt depth is set to 9 for base-to-novel generalization and few-shot classification, and 3 for cross-dataset and domain generalization. The learnable text prompts of the first layer are initialized with the word embeddings of "a photo of a", while the other learnable prompts are randomly initialized with a normal distribution... For few-shot classification, we train the models for 50 epochs under different low-shot settings (ranging from 1 to 16)... all models are trained for 20 epochs using 16-shot samples with a fixed batch size of 4 and a learning rate of 0.0025 with SGD as the optimizer. We set λ1 = 1, λ2 = 25 and λ3 = 1 in Eq. 9 after extensive hyperparameter search to balance the total loss. |