Tuning Multi-mode Token-level Prompt Alignment across Modalities

Authors: Dongsheng Wang, Miaoge Li, Xinyang Liu, MingSheng Xu, Bo Chen, Hanwang Zhang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on popular image recognition benchmarks show the superior generalization and few-shot abilities of our approach. The qualitative analysis demonstrates that the learned prompt tokens have the ability to capture diverse visual concepts.
Researcher Affiliation Academia Dongsheng Wang, Miaoge Li, Xinyang Liu, Ming Sheng Xu, Bo Chen School of Electronic Engineering, Xidian University, Xi an, China {wds,limiaoge,xinyangatk,msxu}@stu.xidian.edu.cn bchen@mail.xidian.edu.cn Hanwang Zhang School of Computer Science and Engineering, Nanyang Technological University, Singapore hanwangzhang@ntu.edu.sg
Pseudocode Yes We describe our proposed model in the Appendix Algoritm. 1. Algorithm 1 Training algorithm of ALIGN.
Open Source Code Yes The code is available at https://github.com/wds2014/ALIGN.
Open Datasets Yes Datasets To make a comprehensive evaluation, we performed extensive experiments on 4 task settings, such as few-shot image recognition, base-to-new generalization, cross-dataset transfer learning, and domain generalization. Those experiments are conducted on 15 widely used image datasets, varying in scale and domains, including Image Net [44], Caltech101 [45], Oxford Pets [46], Stanford Cars [47], Flowers102 [48], Food101 [49], FGVCAircraft [50], Euro SAT [51], UCF101 [52], DTD [53], SUN397 [54], Image Net V2 [55], Image Net-Sketch [56], Image Net-A [57], and Image Net R [58]. The details of each dataset are provided in the Appendix Table. B. 1.
Dataset Splits Yes Table B. 1: Statistics of the used 15 datasets. N/A denotes that we do not use the corresponding training or validation sets. Dataset ... #Train #Val #Test ... Caltech101 ... 4,128 1,649 2,465
Hardware Specification No The paper mentions general terms like "GPU" in the complexity analysis but does not specify any particular GPU models, CPU models, or other hardware specifications used for running experiments. For example, in Table 4, it mentions "fps" (frames per second) which is a performance metric, but not the hardware on which it was measured beyond a general "GPU" in the surrounding text.
Software Dependencies No The paper mentions "pre-trained Vit-B/16 CLIP model as our backbone" and "fp16" precision in Table B.2, but it does not list specific software dependencies with version numbers such as Python, PyTorch, TensorFlow, or specific libraries.
Experiment Setup Yes Implementation Details Following previous Ma PLe [28], we load the pre-trained Vit-B/16 CLIP model as our backbone, where dl = 512, dv = 768 and d = 512. We set the number of textual and visual prompts M = N = 4, the length of prompt tokens b = 2, the hyperparameter λ = 0.1, and β = 1. The maximum iteration number in the Sinkhorn algorithm is set as 100. For all tasks, we train our model with a batch-size of 4, a learning rate of 0.0035, and an optimizer as SGD. For each task, we optimize the number of epochs. Following Ma PLe we run 2 epochs to train Image Net as a source model with a learning rate of 0.0026. The reported results are the averaged value over 3 seeds. Please refer to the Appendix Sec. B for more details. For all baselines, we set the length of prompts as 4 and collect their results according to the original papers or previous works. Thus, some experimental results may be missing.