Enhancing Large Vision Language Models with Self-Training on Image Comprehension
Authors: Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, Quanquan Gu, James Y. Zou, Kai-Wei Chang, Wei Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate the effectiveness of STIC across seven different benchmarks, demonstrating substantial performance gains of 4.0% on average while using 70% less supervised fine-tuning data than the current method. |
| Researcher Affiliation | Academia | 1University of California, Los Angeles 2University of California, Berkeley 3Stanford University |
| Pseudocode | Yes | We summarize STIC in Algorithms 1 and 2, and detail the process below. |
| Open Source Code | Yes | Code and data are made publicly available on Git Hub. |
| Open Datasets | Yes | For the self-constructed preference dataset, we gather 6k unlabeled image data randomly sampled from the MSCOCO dataset (Lin et al., 2014) and specifically the train2014 split for its high-quality images popularly used for pre-training and fine-tuning. |
| Dataset Splits | No | In the second stage, we randomly subsample 5k used instruction fine-tuning data from LLa VA’s SFT data to construct the description-infused fine-tuning data with model-generated image descriptions. |
| Hardware Specification | Yes | Experiments were conducted on NVIDIA RTX A6000 GPU clusters. |
| Software Dependencies | No | We use low-rank adaptation (Lo RA) fine-tuning (Hu et al., 2021) instead of full fine-tuning for efficient computation. |
| Experiment Setup | Yes | We train for 1 epoch in each stage, including the image comprehension selftraining stage and the description-infused fine-tuning stage. We use the same hyperparameters for Lo RA fine-tuning in both stages, with lora_r = 128, lora_alpha = 256, and lora_target = all. The fine-tuning hyperparameters for Stage 1 are presented in Table 6. |