Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Sherlock: Self-Correcting Reasoning in Vision-Language Models
Authors: Yi Ding, Ruqi Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Sherlock achieves remarkable results across eight benchmarks, reaching an average accuracy of 64.1 with direct generation and 65.4 after self-correction. It outperforms LLa VA-Co T (63.2), Mulberry (63.9), and Llama V-o1 (63.4) while using less than 20% of the annotated data. |
| Researcher Affiliation | Academia | Yi Ding, Ruqi Zhang Department of Computer Science, Purdue University, USA EMAIL |
| Pseudocode | Yes | Algorithm 1: Sherlock: Self-correction and Self-improvement training framework |
| Open Source Code | Yes | Justification: We provide our code in the supplementary materials and github link. |
| Open Datasets | Yes | We evaluate on eight challenging multimodal benchmarks, including VQA (MMBench-V1.1 [20], MMVet [50], MME [17], MMStar [4]), math and science (Math Vista [21], AI2D [15], MMMU [51]), and hallucination (Hallusion Bench [10]). |
| Dataset Splits | Yes | We randomly sample two sets of 10k annotated examples from the LLa VA-Co T dataset, denoted as DA and DB. ... In the offline preference training stage, we construct trajectory-level preference data using the 10k examples in DA. ... In each iteration, we randomly sample 5k unlabeled questions. |
| Hardware Specification | Yes | We compare this approach with parallel majority voting on Math Vista [21], reporting accuracy and inference time (A100 GPU hours) in Table 5. |
| Software Dependencies | No | The paper mentions building on Llama3.2-Vision-11B-Instruct [9] but does not provide specific version numbers for ancillary software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | Table 6: Detailed training hyperparameters for each stage of the Sherlock model. Sherlock model Learning Rate Max Length Batch Size α βDPO Warm-Up Ratio Epoch SFT 1e-6 4096 128 0.03 3 Offline 5e-6 4096 32 0.25 0.1 0.00 1 Iter1 5e-7 4096 32 0.25 0.1 0.00 1 Iter2 5e-7 4096 32 0.25 0.1 0.00 1 |