Language-Driven Cross-Modal Classifier for Zero-Shot Multi-Label Image Recognition

Authors: Yicheng Liu, Jie Wen, Chengliang Liu, Xiaozhao Fang, Zuoyong Li, Yong Xu, Zheng Zhang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments demonstrate that our method outperforms other zeroshot multi-label recognition methods and achieves competitive results compared to few-shot methods. Extensive experiment results show that without using image data to train the model, our method still performs significantly better than many zero-shot methods and few-shot methods for MLR.
Researcher Affiliation Academia 1 Harbin Institute of Technology, Shenzhen, China 2 Guangdong University of Technology, Guangzhou, China 3 Minjiang University, Fuzhou, China.
Pseudocode Yes Algorithm 1 Training process of Co MC. Algorithm 2 Inference process of Co MC.
Open Source Code No The paper does not contain any explicit statements about releasing source code or provide links to a code repository.
Open Datasets Yes We conduct experiments on MS-COCO (Lin et al., 2014), VOC2007 (Everingham et al., 2010), and NUS-WIDE (Chua et al., 2009) to evaluate the superiority of the proposed method on multi-label recognition tasks.
Dataset Splits No The paper specifies test splits for MS-COCO, VOC2007, and NUS-WIDE, and mentions training images for NUS-WIDE, but does not explicitly provide details about a validation split for any of the datasets needed for full reproducibility of data partitioning.
Hardware Specification No The paper does not specify any hardware details such as GPU models, CPU types, or memory used for running the experiments.
Software Dependencies No The paper mentions using "CLIP Res Net-50 as the image encoder and CLIP Transformer as the text encoder" and that "GPT-3 Da Vinci-002 (Brown et al., 2020) model is adopted" for text generation, but it does not list any specific software library versions (e.g., PyTorch version, Python version, specific optimizer library versions) required for reproduction.
Experiment Setup Yes We use a cosine learning rate decay with an initial learning rate of 1e-4. We train our classifier using the Adam optimizer with a batch size of 256 to optimize the classifier for 30 epochs. For inference, the input images are resized into 224 224. The weighting factor α is set to 0.7, 0.4, and 0.5 on MS-COCO, VOC2007, and NUS-SIDE, respectively. The temperature parameter τ is set to 1/100.