Compound Text-Guided Prompt Tuning via Image-Adaptive Cues

Authors: Hao Tan, Jun Li, Yizhuang Zhou, Jun Wan, Zhen Lei, Xiangyu Zhang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on few-shot recognition and domain generalization demonstrate that TGP-T achieves superior performance with consistently lower training costs.
Researcher Affiliation Collaboration 1MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China 2School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China 3MEGVII Technology 4CAIR, HKISI, Chinese Academy of Sciences, Hong Kong, China {tanhao2023, lijun2021, jun.wan, zhen.lei}@ia.ac.cn, {zhouyizhuang, zhangxiangyu}@megvii.com
Pseudocode No No explicit pseudocode or algorithm blocks are provided.
Open Source Code Yes The code is available at https://github.com/Eric Tan7/TGP-T.
Open Datasets Yes Following CLIP (Radford et al. 2021), we adopt 11 publicly available image classification datasets that cover diverse scenes and scales, including Image Net (Deng et al. 2009), Caltech (Fei-Fei, Fergus, and Perona 2004), Oxford Pets (Parkhi et al. 2012), Flowers (Nilsback and Zisserman 2008), Food101 (Bossard, Guillaumin, and Van Gool 2014), Stanford Cars (Krause et al. 2013), FGVCAircraft (Maji et al. 2013), Euro SAT (Helber et al. 2019), UCF101 (Soomro, Zamir, and Shah 2012), DTD (Cimpoi et al. 2014), and SUN397 (Xiao et al. 2010).
Dataset Splits Yes We follow the few-shot evaluation protocol in Co Op (Zhou et al. 2022b), i.e., we use 1, 2, 4, 8, and 16 shots for training, respectively, and report results on the full test sets. ... We tune the hyperparameters on a few-shot validation set with min(n, 4) shots (n is the number of training shots) rather than searching on the test set.
Hardware Specification Yes Note that when using batch size of 8, Co Co Op runs into out-of-memory (OOM) problems on Stanford Cars, SUN397, and Image Net with Nvidia RTX 3090. ... Moreover, TGP-T enables the utilization of more powerful backbones such as Vi T-L/14, while Co Op, Co Co Op, and Ma PLe run into out-of-memory (OOM) problems on Nvidia RTX 3090.
Software Dependencies No The paper mentions software like 'Adam W optimizer' but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes We set Vi T-B/16 as the image encoder. The depth of the Bonder is set to 1. The number of category-wise and content-wise prompt queries is 32 and 64, respectively. We adopt the Adam W optimizer (Loshchilov and Hutter 2017) with a learning rate of 5e-5 and a weight decay of 1e-4. The model is trained for 12,800 iterations with a batch size of 8.