AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation

Authors: Yuhan Zhu, Yuyang Ji, Zhiyu Zhao, Gangshan Wu, Limin Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We verify AWT in multiple challenging scenarios, including zero-shot and few-shot image classification, zero-shot video action recognition, and out-of-distribution generalization. AWT consistently outperforms the state-of-the-art methods in each setting. In addition, our extensive studies further demonstrate AWT s effectiveness and adaptability across different VLMs, architectures, and scales.
Researcher Affiliation Academia Yuhan Zhu1 Yuyang Ji1 Zhiyu Zhao1,2 Gangshan Wu1 Limin Wang1,2 1State Key Laboratory for Novel Software Technology, Nanjing University 2Shanghai AI Laboratory
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes https://github.com/MCG-NJU/AWT
Open Datasets Yes Our study encompasses 18 datasets that span a wide array of recognition tasks: Image Net [89], Caltech101 [95] and Caltech256 [101] for generic object recognition, Oxford Pets [92], Stanford Cars [93], Oxford Flowers [90], Food101 [96], FGVCAircraft [98], Birdsnap [100] and CUB [102] for fine-grained classification, SUN397 [97] for scene recognition, DTD [91] for texture classification, Euro SAT [99] for satellite recognition, and UCF101 [94] for action recognition. Besides, four Image Net variant datasets are involved to assess the model s capability for OOD generalization: Image Net-A [103], Image Net V2 [104], Image Net-R [105] and Image Net-Sketch [106].
Dataset Splits No The paper discusses training and testing for few-shot learning (e.g., "We trained our model using 1, 2, 4, 8, and 16 shots") and provides training hyperparameters in Table 12, but does not explicitly mention a separate validation dataset split.
Hardware Specification Yes All experiments are conducted on one NVIDIA A100-SXM4-80GB GPU.
Software Dependencies No The paper mentions using "CLIP-B/16 model [1]", "GPT-3.5 [35]", and "Sinkhorn s Algorithm [69]" but does not provide specific version numbers for these software dependencies or libraries.
Experiment Setup Yes We implemented the AWT framework using the CLIP-B/16 model [1]. Image augmentations include random resized cropping and flipping, and class descriptions are generated via GPT-3.5 [35]. We set the number of augmented images N and descriptions M to 50 each. Dataset-level descriptions are provided in Appendix C. For both visual and textual modalities, we configured the importance distribution temperatures at γv = 1/2 and γt = 1/2. The optimal transport problem is approximated using Sinkhorn s Algorithm with an ϵ of 0.1 [69]. (...) Detailed settings and hyperparameters for our few-shot learning experiments are outlined in Tab. 12.