Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling

Authors: Tsung-Han (Patrick) Wu, Heekyung Lee, Jiaxin Ge, Joseph E Gonzalez, Trevor Darrell, David Chan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluations show that REVERSE achieves state-of-the-art hallucination reduction, outperforming the best existing methods by up to 12% on CHAIR-MSCOCO and 34% on Halo Quest. ... We evaluate REVERSE against SOTA hallucination reduction baselines across a wide range of benchmarks designed for hallucination evaluation on LLa VA-v1.5 [34], LLa VA-MORE [17], and Qwen2.5-VL [6]. On captioning tasks, REVERSE achieves up to a 12% reduction in CHAIR scores on CHAIR-MSCOCO [40] and AMBER [47] over the best existing methods. On hallucination-sensitive open-ended tasks, it also delivers over a 10% and 34% performance improvement on MMHal [42] and Halo Quest [49], respectively. ... REVERSE significantly reduces hallucination with minimal loss in expressiveness and efficiency (shown in Table 1, Table 2).
Researcher Affiliation	Academia	Tsung-Han Wu1 Heekyung Lee1,2 Jiaxin Ge1 Joseph E. Gonzalez1 Trevor Darrell1 David M. Chan1 1UC Berkeley 2POSTECH
Pseudocode	Yes	Algorithm 1 On-the-Fly Retrospective Resampling During Generation
Open Source Code	Yes	The code for REVERSE is released under the MIT license at https://github.com/tsunghan-wu/reverse_vlm.
Open Datasets	Yes	We also release model checkpoints of REVERSE (based on LLa VA-v1.5, LLa VA-MORE, and Qwen2.5-VL), along with a 1.3M-sample semi-synthetic dataset, at Hugging Face. Both the checkpoints and dataset are released under the MIT license. ... CHAIR-MSCOCO [40], AMBER [47], MMHal [42], and Halo Quest [49]
Dataset Splits	No	The paper mentions evaluating on specific benchmarks like 'CHAIR-MSCOCO [40]', 'AMBER [47]', 'MMHal [42]', and 'Halo Quest [49]', and describes using 'the full MSCOCO validation set' or a 'subset of 500 captions' for CHAIR-MSCOCO. For training, it uses a '1.3M VLM instruction-tuning dataset' and states 'finetune...on the same 100k subset used for LLa VA’s instruction data and a matched subset from our dataset' for Qwen2.5-VL. However, it does not explicitly provide detailed train/validation/test splits for its own 1.3M semi-synthetic dataset, nor specific splitting methodologies for all experiments.
Hardware Specification	Yes	Training takes 24 hours for LLa VA-v1.5-7B and 36 hours for LLa VA-MORE on 8 A100 80GB GPUs using Deep Speed Ze RO-2. ... Training takes 3 hours on 4 A100 80GB GPUs using Deep Speed Ze RO-3.
Software Dependencies	No	The paper mentions 'Deep Speed Ze RO-2' and 'Deep Speed Ze RO-3' for distributed training, and references 'spacy' [22] and 'NLTK' in the text. However, it does not provide specific version numbers for these software components or other key libraries (like Python, PyTorch, CUDA, etc.) used in their implementation.
Experiment Setup	Yes	For both LLa VA-v1.5-7B and LLa VA-MORE, we initialize from pretrained language models (Vicuna1.5-7B and Llama-3.1-8B-Instruct, respectively), along with their corresponding visual projectors and CLIP-Vi T-L/14-336 vision encoders. Following the standard LLa VA setup, we perform Lo RA fine-tuning on the pre-trained model directly with our modified cross-entropy loss (see subsection 3.2) and the 1.3M-sample dataset for one epoch with Lo RA (rank = 128, α = 256). We adopt the modified cross-entropy loss defined in subsection 3.2 and train using the Adam W optimizer. The learning rate is set to 2e-5 for the visual projector, and (2e-4, 1e-4) for the Lo RA parameters of LLa VA-v1.5-7B and LLa VA-MORE, respectively. The CLIP backbone is kept frozen. We use a global batch size of 128 with no gradient accumulation. ... We use the Adam W optimizer with a learning rate of 5e-5, freeze the CLIP encoder, and set the batch size to 128 with no gradient accumulation. ... During inference, we apply retrospective resampling with different threshold values: τ =0.003 for LLa VAseries models and τ =0.01 for Qwen2.5-VL. ... For the correction mechanism, we allow up to a N =50 total correction attempts, with local correction attempts of K =10. Additionally, we implement rejection sampling with a base temperature of T0, gradually increasing it with a step size of T =0.1, capped at a maximum temperature of T0+0.5: T =min(T + T, T0+0.5).