Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Towards Self-Refinement of Vision-Language Models with Triangular Consistency

Authors: Yunlong Deng, Guangyi Chen, Tianpei Gu, Lingjing Kong, Yan Li, Zeyu Tang, Kun Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using the widely recognized LLa VA-1.5 as our baseline, our experiments reveal that the model can autonomously achieve consistent, though deliberately modest, improvements across multiple benchmarks without any external supervision, such as human annotations or environmental feedback.
Researcher Affiliation Collaboration 1Mohamed bin Zayed University of Artificial Intelligence 2Carnegie Mellon University 3Byte Dance US
Pseudocode Yes Algorithm A1 Iterative Self-Refinement
Open Source Code Yes Code is available at SRF-LLa VA.
Open Datasets Yes The dataset, comprising 2 million images accompanied by generated instructions, will be publicly released to facilitate further research. ... we randomly selected 2.8 million images from LAION [36] as the unlabeled image set ... VQAv2 [30], GQA [31], and Science QA [32] datasets. For visual perception and reasoning tasks, we employed MMBench [29], MMBench-Chinese [29], MME [37], and MM-Vet [28] benchmarks. To assess visual dialogue ability, we utilized the LLa VA-Bench(In-the-Wild) [1] benchmark.
Dataset Splits Yes In the evaluation, we followed the setup of LLa VA-1.5, which contains 8 benchmarks. For traditional VQA tasks, we used the VQAv2 [30], GQA [31], and Science QA [32] datasets. For visual perception and reasoning tasks, we employed MMBench [29], MMBench-Chinese [29], MME [37], and MM-Vet [28] benchmarks. To assess visual dialogue ability, we utilized the LLa VA-Bench(In-the-Wild) [1] benchmark.
Hardware Specification Yes Table 2 compares the total wall-clock times of SRF-LLa VA-1.5 and LLa VA-1.5, both trained on a cluster equipped with 8 NVIDIA H100-NVL GPUs (96 GB each).
Software Dependencies No For shorter texts, we employed UAE-Large-V1[22], a compact Sentence Transformer with only 335 million parameters, that effectively compares single-sentence text similarity without introducing external knowledge. For longer texts, since the performance of the Sentence Transformer diminishes, we opt for Bert Score[23], based on the bert-base-uncased model.
Experiment Setup Yes The training hyperparameters were consistent with those of LLa VA-1.5: an initial learning rate of 2 10 5 and a batch size of 128.