AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation
Authors: Yuhan Zhu, Yuyang Ji, Zhiyu Zhao, Gangshan Wu, Limin Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We verify AWT in multiple challenging scenarios, including zero-shot and few-shot image classification, zero-shot video action recognition, and out-of-distribution generalization. AWT consistently outperforms the state-of-the-art methods in each setting. In addition, our extensive studies further demonstrate AWT s effectiveness and adaptability across different VLMs, architectures, and scales. |
| Researcher Affiliation | Academia | Yuhan Zhu1 Yuyang Ji1 Zhiyu Zhao1,2 Gangshan Wu1 Limin Wang1,2 1State Key Laboratory for Novel Software Technology, Nanjing University 2Shanghai AI Laboratory |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://github.com/MCG-NJU/AWT |
| Open Datasets | Yes | Our study encompasses 18 datasets that span a wide array of recognition tasks: Image Net [89], Caltech101 [95] and Caltech256 [101] for generic object recognition, Oxford Pets [92], Stanford Cars [93], Oxford Flowers [90], Food101 [96], FGVCAircraft [98], Birdsnap [100] and CUB [102] for fine-grained classification, SUN397 [97] for scene recognition, DTD [91] for texture classification, Euro SAT [99] for satellite recognition, and UCF101 [94] for action recognition. Besides, four Image Net variant datasets are involved to assess the model s capability for OOD generalization: Image Net-A [103], Image Net V2 [104], Image Net-R [105] and Image Net-Sketch [106]. |
| Dataset Splits | No | The paper discusses training and testing for few-shot learning (e.g., "We trained our model using 1, 2, 4, 8, and 16 shots") and provides training hyperparameters in Table 12, but does not explicitly mention a separate validation dataset split. |
| Hardware Specification | Yes | All experiments are conducted on one NVIDIA A100-SXM4-80GB GPU. |
| Software Dependencies | No | The paper mentions using "CLIP-B/16 model [1]", "GPT-3.5 [35]", and "Sinkhorn s Algorithm [69]" but does not provide specific version numbers for these software dependencies or libraries. |
| Experiment Setup | Yes | We implemented the AWT framework using the CLIP-B/16 model [1]. Image augmentations include random resized cropping and flipping, and class descriptions are generated via GPT-3.5 [35]. We set the number of augmented images N and descriptions M to 50 each. Dataset-level descriptions are provided in Appendix C. For both visual and textual modalities, we configured the importance distribution temperatures at γv = 1/2 and γt = 1/2. The optimal transport problem is approximated using Sinkhorn s Algorithm with an ϵ of 0.1 [69]. (...) Detailed settings and hyperparameters for our few-shot learning experiments are outlined in Tab. 12. |