COMMA: Co-articulated Multi-Modal Learning

Authors: Lianyu Hu, Liqing Gao, Zekang Liu, Chi-Man Pun, Wei Feng

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method across three representative tasks of generalization to novel classes, new target datasets and unseen domain shifts. Experimental results demonstrate the superiority of our method by exhibiting a favorable performance boost upon all tasks with high efficiency.
Researcher Affiliation Academia 1College of Intelligence and Computing, Tianjin University, China 2Department of Computer and Information Science, University of Macau, China
Pseudocode No The paper describes the method using mathematical equations and text but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/hulianyuyy/COMMA
Open Datasets Yes For base-to-novel generalization and cross-dataset evaluation, we follow previous methods (Khattak et al. 2023; Yao, Zhang, and Xu 2023) to evaluate the performance of our method on 11 image classification datasets, including two generic-objects datasets, Image Net (Deng et al. 2009) and Caltech101 (Fei-Fei, Fergus, and Perona 2004); five fine-grained datasets, Oxford Pets (Parkhi et al. 2012), Stanford Cars (Krause et al. 2013), Flowers102 (Nilsback and Zisserman 2008), Food101 (Bossard, Guillaumin, and Van Gool 2014), and FGVCAircraft (Maji et al. 2013); a scene recognition dataset SUN397 (Xiao et al. 2010); an action recognition dataset UCF101 (Soomro, Zamir, and Shah 2012); a texture dataset DTD (Cimpoi et al. 2014) and a satelliteimage dataset Euro SAT (Helber et al. 2019).
Dataset Splits Yes Base-to-Novel Generalization: The datasets are split into base and novel classes to evaluate the model in a zero-shot manner. The model is trained on the base classes in a few-shot setting and evaluated on base and novel classes.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU, CPU models, or memory) used for running the experiments.
Software Dependencies No The paper mentions using a 'pretrained Vi T-B/16 CLIP model' and 'SGD optimizer' but does not specify version numbers for any software libraries or dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes For all experiments, we use the pretrained Vi T-B/16 CLIP model by default with dl = 512, dl = 768. We use a 16-shot training strategy in all experiments by default which randomly samples 16 shots for each class. Following previous methods (Khattak et al. 2023), we set prompt depth J to 9 and the language and vision prompt lengths to 2. We train our models for 5 epochs with a batchsize of 4 and a learning rate of 0.0035 with the SGD optimizer.