Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Sherlock: Self-Correcting Reasoning in Vision-Language Models

Authors: Yi Ding, Ruqi Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Sherlock achieves remarkable results across eight benchmarks, reaching an average accuracy of 64.1 with direct generation and 65.4 after self-correction. It outperforms LLa VA-Co T (63.2), Mulberry (63.9), and Llama V-o1 (63.4) while using less than 20% of the annotated data.
Researcher Affiliation	Academia	Yi Ding, Ruqi Zhang Department of Computer Science, Purdue University, USA EMAIL
Pseudocode	Yes	Algorithm 1: Sherlock: Self-correction and Self-improvement training framework
Open Source Code	Yes	Justification: We provide our code in the supplementary materials and github link.
Open Datasets	Yes	We evaluate on eight challenging multimodal benchmarks, including VQA (MMBench-V1.1 [20], MMVet [50], MME [17], MMStar [4]), math and science (Math Vista [21], AI2D [15], MMMU [51]), and hallucination (Hallusion Bench [10]).
Dataset Splits	Yes	We randomly sample two sets of 10k annotated examples from the LLa VA-Co T dataset, denoted as DA and DB. ... In the offline preference training stage, we construct trajectory-level preference data using the 10k examples in DA. ... In each iteration, we randomly sample 5k unlabeled questions.
Hardware Specification	Yes	We compare this approach with parallel majority voting on Math Vista [21], reporting accuracy and inference time (A100 GPU hours) in Table 5.
Software Dependencies	No	The paper mentions building on Llama3.2-Vision-11B-Instruct [9] but does not provide specific version numbers for ancillary software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	Table 6: Detailed training hyperparameters for each stage of the Sherlock model. Sherlock model Learning Rate Max Length Batch Size α βDPO Warm-Up Ratio Epoch SFT 1e-6 4096 128 0.03 3 Offline 5e-6 4096 32 0.25 0.1 0.00 1 Iter1 5e-7 4096 32 0.25 0.1 0.00 1 Iter2 5e-7 4096 32 0.25 0.1 0.00 1