LAMM: Label Alignment for Multi-Modal Prompt Learning

Authors: Jingsheng Gao, Jiacheng Ruan, Suncheng Xiang, Zefang Yu, Ke Ji, Mingye Xie, Ting Liu, Yuzhuo Fu

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on 11 downstream vision datasets and demonstrate that our method significantly improves the performance of existing multi-modal prompt learning models in few-shot scenarios, exhibiting an average accuracy improvement of 2.31(%) compared to the state-of-the-art methods on 16 shots.
Researcher Affiliation Academia 1 School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, China 2 School of Biomedical Engineering, Shanghai Jiao Tong University, China 3 School of Computer Science and Engineering, Southeast University, China
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code and dataset will be publicly available at https://github.com/gaojingsheng/LAMM.
Open Datasets Yes We follow the datasets used in previous works (Zhou et al. 2022b; Khattak et al. 2022) and evaluate our method on 11 image classification datasets, including Caltech101 (Fei-Fei, Fergus, and Perona 2007), Image Net (Deng et al. 2009), Oxford Pets (Parkhi et al. 2012), Cars (Krause et al. 2013), Flowers102 (Nilsback and Zisserman 2008), Food101 (Bossard, Guillaumin, and Gool 2014), FGVC (Maji et al. 2013), SUN397 (Xiao et al. 2010), UCF101 (Soomro, Zamir, and Shah 2012), DTD (Cimpoi et al. 2014) and Euro SAT (Helber et al. 2019).
Dataset Splits No Specifically, we use 1, 2, 4, 8, and 16 shots for training respectively, and evaluate the models on the full test sets. All experimental results are the average of the results obtained from running the experiments on seeds 1, 2 and 3. The paper specifies training shots and evaluation on full test sets, but does not explicitly mention a separate validation split or how hyperparameters were tuned if not on a validation set.
Hardware Specification Yes All of our experiments are conducted on a single NVIDIA A100.
Software Dependencies No The paper mentions models like CLIP, Co Op, and Ma PLe and that the training parameters were kept the same as their original settings, but it does not specify software dependencies like programming language or library versions (e.g., Python version, PyTorch version).
Experiment Setup Yes We keep the same training parameters (e.g., learning rate, epochs, and other prompt parameters) of each model in their original settings, where the epoch of Co Op is 50 and Ma PLe is 5. As for vanilla CLIP + LAMM, we follow the settings of Co Op. All of our experiments are conducted on a single NVIDIA A100. The corresponding hyper-parameters are fixed across all datasets in our work.