Why are Visually-Grounded Language Models Bad at Image Classification?

Authors: Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, Serena Yeung

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we revisit the image classification task using visuallygrounded language models (VLMs) such as GPT-4V and LLa VA. We find that existing proprietary and public VLMs, despite often using CLIP as a vision encoder and having many more parameters, significantly underperform CLIP on standard image classification benchmarks like Image Net. To understand the reason, we explore several hypotheses concerning the inference algorithms, training objectives, and data processing in VLMs. Our analysis reveals that the primary cause is datarelated: critical information for image classification is encoded in the VLM s latent space but can only be effectively decoded with enough training data. Specifically, there is a strong correlation between the frequency of class exposure during VLM training and instruction-tuning and the VLM s performance in those classes; when trained with sufficient data, VLMs can match the accuracy of state-of-the-art classification models. Based on these findings, we enhance a VLM by integrating classification-focused datasets into its training, and demonstrate that the enhanced classification performance of the VLM transfers to its general capabilities, resulting in an improvement of 11.8% on the newly collected Image Wiki QA dataset.
Researcher Affiliation Collaboration 1Stanford University 2University of Washington 3Tsinghua University
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes We provide an open-source implementation of our work and have released the Image Wiki QA dataset at https://github.com/yuhui-zh15/VLMClassifier.
Open Datasets Yes We evaluated the aforementioned models on four widely-used image classification benchmarks: Image Net (shortened as , same below) [11], Flowers102 ( ) [35], Stanford Cars ( ) [21], and Caltech101 ( ) [13], which contain 50,000, 6,149, 8,041, and 4,331 test images from 1,000, 102, 196, and 101 classes, respectively.
Dataset Splits Yes Table 7: Data details. Dataset Link Training Validation # Classes Image Net [11] https://www.image-net.org/ 1.28M 50K 1000 Flowers102 [35] https://www.tensorflow.org/datasets/catalog/oxford_flowers102 2.0K 6.1K 102 Stanford Cars [21] https://www.tensorflow.org/datasets/catalog/cars196 8.1K 8.0K 196 Caltech101 [13] https://www.tensorflow.org/datasets/catalog/caltech101 4.3K 4.3K 101
Hardware Specification Yes We use four NVIDIA L40S GPUs for all experiments.
Software Dependencies No The paper mentions software components like 'Adam W optimizer' and 'Lo RA' but does not provide specific version numbers for these or other key software dependencies (e.g., Python, PyTorch).
Experiment Setup Yes LLa VA fine-tuning details. We convert each image and class label into the text format using the LLa VA default template USER: <576 Image Tokens> What type of object is in this photo? ASSISTANT: <Class Name>. We conduct two settings for fine-tuning LLa VA. In the first setting, we only fine-tune the MLP projector between CLIP and the language model (LM). The projector is trained on the training set with a batch size of 64 and a learning rate of 2e-5 using the Adam W optimizer for 50 epochs (1 epoch for Image Net), with a warmup ratio of 0.03. In the second setting, we fine-tune both the MLP projector and the LM using Lo RA. The projector and LM are trained on the training set with a batch size of 64, a learning rate of 2e-5 for the projector and 2e-4 for the LM, using the Adam W optimizer for 50 epochs (1 epoch for Image Net), with a warmup ratio of 0.03, a Lo RA rank of 128, and a Lo RA alpha of 256. For both settings, we report the best performance on the validation set after training.