Enhancing Large Vision Language Models with Self-Training on Image Comprehension

Authors: Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, Quanquan Gu, James Y. Zou, Kai-Wei Chang, Wei Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the effectiveness of STIC across seven different benchmarks, demonstrating substantial performance gains of 4.0% on average while using 70% less supervised fine-tuning data than the current method.
Researcher Affiliation Academia 1University of California, Los Angeles 2University of California, Berkeley 3Stanford University
Pseudocode Yes We summarize STIC in Algorithms 1 and 2, and detail the process below.
Open Source Code Yes Code and data are made publicly available on Git Hub.
Open Datasets Yes For the self-constructed preference dataset, we gather 6k unlabeled image data randomly sampled from the MSCOCO dataset (Lin et al., 2014) and specifically the train2014 split for its high-quality images popularly used for pre-training and fine-tuning.
Dataset Splits No In the second stage, we randomly subsample 5k used instruction fine-tuning data from LLa VA’s SFT data to construct the description-infused fine-tuning data with model-generated image descriptions.
Hardware Specification Yes Experiments were conducted on NVIDIA RTX A6000 GPU clusters.
Software Dependencies No We use low-rank adaptation (Lo RA) fine-tuning (Hu et al., 2021) instead of full fine-tuning for efficient computation.
Experiment Setup Yes We train for 1 epoch in each stage, including the image comprehension selftraining stage and the description-infused fine-tuning stage. We use the same hyperparameters for Lo RA fine-tuning in both stages, with lora_r = 128, lora_alpha = 256, and lora_target = all. The fine-tuning hyperparameters for Stage 1 are presented in Table 6.