Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Enhancing Large Vision Language Models with Self-Training on Image Comprehension
Authors: Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, Quanquan Gu, James Y. Zou, Kai-Wei Chang, Wei Wang
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate the effectiveness of STIC across seven different benchmarks, demonstrating substantial performance gains of 4.0% on average while using 70% less supervised fine-tuning data than the current method. |
| Researcher Affiliation | Academia | 1University of California, Los Angeles 2University of California, Berkeley 3Stanford University |
| Pseudocode | Yes | We summarize STIC in Algorithms 1 and 2, and detail the process below. |
| Open Source Code | Yes | Code and data are made publicly available on Git Hub. |
| Open Datasets | Yes | For the self-constructed preference dataset, we gather 6k unlabeled image data randomly sampled from the MSCOCO dataset (Lin et al., 2014) and specifically the train2014 split for its high-quality images popularly used for pre-training and fine-tuning. |
| Dataset Splits | No | In the second stage, we randomly subsample 5k used instruction fine-tuning data from LLa VA’s SFT data to construct the description-infused fine-tuning data with model-generated image descriptions. |
| Hardware Specification | Yes | Experiments were conducted on NVIDIA RTX A6000 GPU clusters. |
| Software Dependencies | No | We use low-rank adaptation (Lo RA) fine-tuning (Hu et al., 2021) instead of full fine-tuning for efficient computation. |
| Experiment Setup | Yes | We train for 1 epoch in each stage, including the image comprehension selftraining stage and the description-infused fine-tuning stage. We use the same hyperparameters for Lo RA fine-tuning in both stages, with lora_r = 128, lora_alpha = 256, and lora_target = all. The fine-tuning hyperparameters for Stage 1 are presented in Table 6. |