S3A: Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment

Authors: Sheng Zhang, Muzammal Naseer, Guangyi Chen, Zhiqiang Shen, Salman Khan, Kun Zhang, Fahad Shahbaz Khan

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our comprehensive experiments across various generic and fine-grained benchmarks demonstrate that the S3A method substantially improves over existing VLMsbased approaches, achieving a more than 15% accuracy improvement over CLIP on average. Our codes, models, and prompts are publicly released at https://github.com/shengeatamath/S3A.
Researcher Affiliation Academia Mohamed bin Zayed University of Artificial Intelligence Carnegie Mellon University Australian National University Link oping University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. Figure 3 illustrates the CVPR algorithm, but it is a diagram, not pseudocode.
Open Source Code Yes Our codes, models, and prompts are publicly released at https://github.com/shengeatamath/S3A.
Open Datasets Yes We evaluate S3A on two generic and five fine-grained benchmarks, i.e., the generic benchmarks of sampled Image Net-100 (IN100) and Image Net-1K (IN1K) (Deng et al. 2009), and fine-grained benchmarks of Stanford Dogs-120 (SDogs) (Khosla et al. 2011), Living1768 (LV17), Nonliving26-104 (NL26), Entity13-260 (ET13), and Entity30-240 (ET30) in BREEDS (Santurkar, Tsipras, and Madry 2020)).
Dataset Splits No The paper lists datasets used for evaluation and mentions training iterations/epochs, but it does not provide explicit train/validation/test dataset splits (e.g., percentages, sample counts, or specific citations to predefined splits with full details) needed for reproduction.
Hardware Specification Yes Experiments are conducted on a single A6000 GPU.
Software Dependencies No The paper states 'Our data augmentations and optimizer follow MUST (Li, Savarese, and Hoi 2022)' but does not provide specific version numbers for software dependencies such as libraries or frameworks (e.g., PyTorch version, Python version, CUDA version).
Experiment Setup Yes In our method, we fix m = 3 and γ = 0.25 on all datasets. Considering efficiency, we only compute prompting at the first epoch. We adopt Vi T-B/16 (Dosovitskiy et al. 2020) as our CLIP backbone. Our data augmentations and optimizer follow MUST (Li, Savarese, and Hoi 2022). We train on all datasets for up to 30K iterations, with 60 epochs for Pet and 30 epochs for other datasets. Besides, we linearly warmup the EMA decay parameter to 0.9998 within specified iterations. We set the initial EMA weight decay of Pet and other datasets as 0.99 and 0.999, respectively. The warmup iterations are 500 for CIFAR, 100 for Pet, and 2000 for other datasets. The threshold τ are 0.3 for CIFAR, 0.7 for Pet, and 0.5 for other datasets.