Delving into Multimodal Prompting for Fine-Grained Visual Classification
Authors: Xin Jiang, Hao Tang, Junyao Gao, Xiaoyu Du, Shengfeng He, Zechao Li
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments conducted on four FGVC datasets demonstrate the effectiveness of our MP-FGVC. |
| Researcher Affiliation | Academia | 1Nanjing University of Science and Technology, China 2Tongji University, China 3Singapore Management University, Singapore |
| Pseudocode | No | The paper describes methods textually and with diagrams (Figure 1), but does not provide pseudocode or a clearly labeled algorithm block. |
| Open Source Code | No | The paper does not provide a direct link to a source-code repository or explicitly state that the code for the methodology is being released or is available. |
| Open Datasets | Yes | Dataset. CUB-200-2011 (Wah et al. 2011) comprises 11, 788 bird images from 200 bird species... Stanford Dogs (Khosla et al. 2011) consists of 20, 580 images... NABirds (Horn et al. 2015) contains 48, 562 images... Food101 (Bossard, Guillaumin, and Gool 2014) encompasses 101 different kinds of foods, totaling 101, 000 images. |
| Dataset Splits | Yes | CUB-200-2011 (Wah et al. 2011) comprises 11, 788 bird images from 200 bird species, which are officially split into 5, 994 training images and 5, 794 test images. Stanford Dogs (Khosla et al. 2011) consists of 20, 580 images depicting 120 dog variants, with 12, 000 images allocated for training and 8, 580 images designated for testing. NABirds (Horn et al. 2015) contains 48, 562 images showcasing North American birds across 555 subcategories. It is split into 23, 929 training images and 24, 633 test images. Food101 (Bossard, Guillaumin, and Gool 2014) encompasses 101 different kinds of foods, totaling 101, 000 images. Within each class, 250 images are for testing, while the remaining 750 images are for training. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used to run its experiments, such as specific GPU models, CPU models, or cloud computing instance types. It only discusses software and dataset details. |
| Software Dependencies | No | The paper states 'We employ the VIT-B16 (Dosovitskiy et al. 2021a), as our image encoder I( ). The text encoder T ( ), is a pre-trained Transformer model from CLIP (Radford et al. 2021).' This mentions the models used but not specific software libraries like PyTorch or TensorFlow with version numbers. |
| Experiment Setup | Yes | Implementation details. We employ the VIT-B16 (Dosovitskiy et al. 2021a), as our image encoder I( ). The text encoder T ( ), is a pre-trained Transformer model from CLIP (Radford et al. 2021). All input images are resized to a resolution of 448 448. In the first training stage, we initialize the learning rate as 3e-2, except for Stanford Dogs where it is initialized as 3e-3. The number of training epochs is set to 30 for both CUB-200-2011 and Stanford Dogs, while the remaining datasets are trained for 10 epochs. For the second training stage, we initialize the learning rate as 1e-3, except for Stanford Dogs where it is initialized as 1e-4. The number of training epochs is set to 10 for both CUB-200-2011 and Stanford Dogs, while the other datasets are trained for 3 epochs. Both training stages utilize the SGD optimizer and employ cosine annealing as the optimization scheduler. |