Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Towards Self-Refinement of Vision-Language Models with Triangular Consistency
Authors: Yunlong Deng, Guangyi Chen, Tianpei Gu, Lingjing Kong, Yan Li, Zeyu Tang, Kun Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using the widely recognized LLa VA-1.5 as our baseline, our experiments reveal that the model can autonomously achieve consistent, though deliberately modest, improvements across multiple benchmarks without any external supervision, such as human annotations or environmental feedback. |
| Researcher Affiliation | Collaboration | 1Mohamed bin Zayed University of Artificial Intelligence 2Carnegie Mellon University 3Byte Dance US |
| Pseudocode | Yes | Algorithm A1 Iterative Self-Refinement |
| Open Source Code | Yes | Code is available at SRF-LLa VA. |
| Open Datasets | Yes | The dataset, comprising 2 million images accompanied by generated instructions, will be publicly released to facilitate further research. ... we randomly selected 2.8 million images from LAION [36] as the unlabeled image set ... VQAv2 [30], GQA [31], and Science QA [32] datasets. For visual perception and reasoning tasks, we employed MMBench [29], MMBench-Chinese [29], MME [37], and MM-Vet [28] benchmarks. To assess visual dialogue ability, we utilized the LLa VA-Bench(In-the-Wild) [1] benchmark. |
| Dataset Splits | Yes | In the evaluation, we followed the setup of LLa VA-1.5, which contains 8 benchmarks. For traditional VQA tasks, we used the VQAv2 [30], GQA [31], and Science QA [32] datasets. For visual perception and reasoning tasks, we employed MMBench [29], MMBench-Chinese [29], MME [37], and MM-Vet [28] benchmarks. To assess visual dialogue ability, we utilized the LLa VA-Bench(In-the-Wild) [1] benchmark. |
| Hardware Specification | Yes | Table 2 compares the total wall-clock times of SRF-LLa VA-1.5 and LLa VA-1.5, both trained on a cluster equipped with 8 NVIDIA H100-NVL GPUs (96 GB each). |
| Software Dependencies | No | For shorter texts, we employed UAE-Large-V1[22], a compact Sentence Transformer with only 335 million parameters, that effectively compares single-sentence text similarity without introducing external knowledge. For longer texts, since the performance of the Sentence Transformer diminishes, we opt for Bert Score[23], based on the bert-base-uncased model. |
| Experiment Setup | Yes | The training hyperparameters were consistent with those of LLa VA-1.5: an initial learning rate of 2 10 5 and a batch size of 128. |