Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Sherlock: Self-Correcting Reasoning in Vision-Language Models

Authors: Yi Ding, Ruqi Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Sherlock achieves remarkable results across eight benchmarks, reaching an average accuracy of 64.1 with direct generation and 65.4 after self-correction. It outperforms LLa VA-Co T (63.2), Mulberry (63.9), and Llama V-o1 (63.4) while using less than 20% of the annotated data.
Researcher Affiliation Academia Yi Ding, Ruqi Zhang Department of Computer Science, Purdue University, USA EMAIL
Pseudocode Yes Algorithm 1: Sherlock: Self-correction and Self-improvement training framework
Open Source Code Yes Justification: We provide our code in the supplementary materials and github link.
Open Datasets Yes We evaluate on eight challenging multimodal benchmarks, including VQA (MMBench-V1.1 [20], MMVet [50], MME [17], MMStar [4]), math and science (Math Vista [21], AI2D [15], MMMU [51]), and hallucination (Hallusion Bench [10]).
Dataset Splits Yes We randomly sample two sets of 10k annotated examples from the LLa VA-Co T dataset, denoted as DA and DB. ... In the offline preference training stage, we construct trajectory-level preference data using the 10k examples in DA. ... In each iteration, we randomly sample 5k unlabeled questions.
Hardware Specification Yes We compare this approach with parallel majority voting on Math Vista [21], reporting accuracy and inference time (A100 GPU hours) in Table 5.
Software Dependencies No The paper mentions building on Llama3.2-Vision-11B-Instruct [9] but does not provide specific version numbers for ancillary software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes Table 6: Detailed training hyperparameters for each stage of the Sherlock model. Sherlock model Learning Rate Max Length Batch Size α βDPO Warm-Up Ratio Epoch SFT 1e-6 4096 128 0.03 3 Offline 5e-6 4096 32 0.25 0.1 0.00 1 Iter1 5e-7 4096 32 0.25 0.1 0.00 1 Iter2 5e-7 4096 32 0.25 0.1 0.00 1