CALIP: Zero-Shot Enhancement of CLIP with Parameter-Free Attention

Authors: Ziyu Guo, Renrui Zhang, Longtian Qiu, Xianzheng Ma, Xupeng Miao, Xuming He, Bin Cui

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate CALIP on various benchmarks of 14 datasets for both 2D image and 3D point cloud few-shot classification, showing consistent zero-shot performance improvement over CLIP. Based on that, we further insert a small number of linear layers in CALIP s attention module and verify our robustness under the few-shot settings, which also achieves leading performance compared to existing methods. Those extensive experiments demonstrate the superiority of our approach for efficient enhancement of CLIP.
Researcher Affiliation Academia 1 School of CS and Key Lab of HCST, Peking University 2 The Chinese University of Hong Kong 3 Shanghai AI Laboratory 4 Shanghai Tech University 5 Carnegie Mellon University
Pseudocode No The paper does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code Yes Code is available at https://github.com/Ziyu Guo99/CALIP.
Open Datasets Yes 2D datasets contain a variety of visual concepts, e.g., real-world scenarios, satellite-captured landscapes and detailed textures, which are Image Net (Jia et al. 2009), Caltech101 (Li, Fergus, and Perona 2004), Oxford Pets (Vedaldi 2012), Stanford Cars (Krause et al. 2014), Flowers102 (Nilsback and Zisserman 2008), Food101 (Bossard, Guillaumin, and Gool 2014), FGVCAircraft (Maji et al. 2013), SUN397 (Xiao et al. 2010), DTD (Cimpoi et al. 2013), Euro SAT (Helber et al. 2017) and UCF101 (Soomro, Zamir, and Shah 2012). The 3D datasets include both synthetic and sensor-scanned point clouds: Model Net10 (Wu et al. 2015), Model Net40 (Wu et al. 2015) and Scan Object NN (Uy et al. 2019).
Dataset Splits Yes We report our results on the official validation set for tuning hyperparameters and network structures.
Hardware Specification No The paper does not provide specific details on the hardware used for running experiments, such as GPU/CPU models or types of computing resources.
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) that would be needed to replicate the experiments.
Experiment Setup Yes Following CLIP s (Radford et al. 2021) pre-processing, we resize all test images into 224 224 resolutions and H, W, C of visual spatial feature Fs denote 7, 7, 1024. We set αt and αs for modulating textual and visual attention magnitude both as 2. For the pooling operation of F a s , we select the combination of maximum and average poolings for better features integration. We adopt varying β1, β2, β3 for different datasets to adapt their specific domains. As for textual templates, we refer to CLIP adopting handcrafted ones.