Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Enhancing Large Vision Language Models with Self-Training on Image Comprehension

Authors: Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, Quanquan Gu, James Y. Zou, Kai-Wei Chang, Wei Wang

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the effectiveness of STIC across seven different benchmarks, demonstrating substantial performance gains of 4.0% on average while using 70% less supervised fine-tuning data than the current method.
Researcher Affiliation Academia 1University of California, Los Angeles 2University of California, Berkeley 3Stanford University
Pseudocode Yes We summarize STIC in Algorithms 1 and 2, and detail the process below.
Open Source Code Yes Code and data are made publicly available on Git Hub.
Open Datasets Yes For the self-constructed preference dataset, we gather 6k unlabeled image data randomly sampled from the MSCOCO dataset (Lin et al., 2014) and specifically the train2014 split for its high-quality images popularly used for pre-training and fine-tuning.
Dataset Splits No In the second stage, we randomly subsample 5k used instruction fine-tuning data from LLa VA’s SFT data to construct the description-infused fine-tuning data with model-generated image descriptions.
Hardware Specification Yes Experiments were conducted on NVIDIA RTX A6000 GPU clusters.
Software Dependencies No We use low-rank adaptation (Lo RA) fine-tuning (Hu et al., 2021) instead of full fine-tuning for efficient computation.
Experiment Setup Yes We train for 1 epoch in each stage, including the image comprehension selftraining stage and the description-infused fine-tuning stage. We use the same hyperparameters for Lo RA fine-tuning in both stages, with lora_r = 128, lora_alpha = 256, and lora_target = all. The fine-tuning hyperparameters for Stage 1 are presented in Table 6.