Open-Vocabulary Multi-Label Classification via Multi-Modal Knowledge Transfer
Authors: Sunan He, Taian Guo, Tao Dai, Ruizhi Qiao, Xiujun Shu, Bo Ren, Shu-Tao Xia
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results show that our method significantly outperforms state-of-the-art methods on public benchmark datasets. Extensive results show that our MKT method significantly outperforms the previous ML-ZSL methods and establishes a new state of the art for open-vocabulary multi-label classification on two large-scale benchmark datasets, namely NUS-WIDE and Open Images. In this experiment, we compare our model with traditional ML-ZSL methods. To study the impacts of knowledge distillation and prompt tuning, we conduct experiments with different training schemes and illustrate the results in Table 2. To demonstrate the effectiveness of our proposed two-stream module, we conduct ablation studies of both local and global heads. Table 4 shows the results in terms of m AP and F1 score on NUSWIDE. |
| Researcher Affiliation | Collaboration | Sunan He1,2,3* , Taian Guo3*, Tao Dai1 , Ruizhi Qiao3, Xiujun Shu3, Bo Ren3, Shu-Tao Xia2,4 1 College of Computer Science and Software Engineering, Shenzhen University 2 Tsinghua Shenzhen International Graduate School, Tsinghua University 3 You Tu Lab, Tencent 4 Research Center of Artificial Intelligence, Peng Cheng Laboratory |
| Pseudocode | No | The paper describes the model architecture and equations but does not include any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statement about open-sourcing the code for their methodology or a link to a code repository. |
| Open Datasets | Yes | In the NUS-WIDE dataset, there are 81 human verified labels, in addition to 925 labels based on Flickr user tags. Similar to LESA (Huynh and Elhamifar 2020), we treat 925 labels as seen labels and the other 81 labels as unseen labels. Following official train/test split, we utilize 161,789 images for training and 107,859 images for testing. The Open Images (v4) dataset is more challenging because it consists of 9M training images and 125,456 testing images. Similar to LESA, we treat 7,186 labels with more than 100 images in training set as seen and the most frequent 400 test labels that are not present in training data as unseen. |
| Dataset Splits | No | The paper specifies training and testing splits for NUS-WIDE and Open Images datasets. However, it does not explicitly provide details about a separate validation dataset split (e.g., percentages or counts) used to tune hyperparameters or for early stopping. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions software components like 'Vi T-B/16', 'CLIP', and 'Adam W optimizer' but does not specify their version numbers or other crucial software dependencies like programming language versions (e.g., Python 3.x) or deep learning frameworks (e.g., PyTorch, TensorFlow) with their versions. |
| Experiment Setup | Yes | In the first stage, we use Adam W optimizer with base learning rate of 0.001 and weight decay of 0.005. We adjust base learning rate of the Adam W optimizer to 0.00003 during the second stage for fine-tuning the context embedding. On NUSWIDE, we train the model for 20 epochs with the mini-batch of 128 and 10 epochs with the mini-batch of 16 in the first and second stage, respectively. Considering the large scale of Open Images, the model is trained for 4 epochs and 2 epochs in each stage with the same batch size as above. |