Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Decoupling Contrastive Decoding: Robust Hallucination Mitigation in Multimodal Large Language Models

Authors: Wei Chen, Xin Yan, Bin Wen, Fan Yang, Tingting Gao, Di ZHANG, Long Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive ablations across hallucination benchmarks and general reasoning tasks demonstrate the effectiveness of DCD, i.e., it matches DPO s hallucination suppression while preserving general capabilities and outperforms the handcrafted contrastive decoding methods.
Researcher Affiliation	Collaboration	Wei Chen1, Xin Yan2, Bin Wen3, Fan Yang3, Tingting Gao3, Di Zhang3, Long Chen1 1HKUST, 2University of Waterloo, 3Kuaishou Technology EMAIL EMAIL
Pseudocode	Yes	Algorithm 1: Decoupling Contrastive Decoding
Open Source Code	Yes	Code is available in https://github.com/HKUST-Long Group/DCD.
Open Datasets	Yes	We evaluated our approach on four widely-used hallucination preference datasets: RLHF-V [16] (human-annotated visual preferences), BPO [18] (data-augmented synthetic preference pairs), RLAIF-V [17] (AI-annotated preferences), and VLFeedback [19] (dense visual faithfulness annotations). For VLFeedback, we threshold responses using Visual Faithfulness scores (above four were considered positive, and those below two were considered negative), while others provide explicit preference pairs. Our method leverages these datasets for positive and negative projection learning. ... Hallucination Benchmarks: We used MM-Vet [34] (open-ended VQA), MMHal [32] (hallucination severity scoring), Hallusion Bench [33] (adversarial visual contradictions), and POPE [31] (object existence verification) to assess the hallucination. General Benchmarks: We selected SEEDBench [36] (multimodal understanding), MMStar [38] (complex VQA), and MMMU [37] (multi-discipline university-level problems) for general performance evaluation. These benchmarks provide comprehensive coverage of tasks for MLLMs. We also evaluated our method on Math Vista [35] to assess the performance on mathematical visual reasoning.
Dataset Splits	No	The paper refers to using existing "hallucination preference datasets" for training and separate "evaluation benchmarks" for testing, but it does not specify explicit train/validation/test splits (e.g., percentages, sample counts, or explicit standard split names) for the preference datasets used in its own training procedure.
Hardware Specification	Yes	For training, we use the above four hallucination-related preference datasets: RLHF-V [16] is trained for 2 epochs, while the remaining datasets are trained for 1 epoch each on NVIDIA A100 80GB.
Software Dependencies	No	We conduct our experiments on LLa VA 1.5-7B [1], training only the image projection layer while keeping all other parameters frozen.
Experiment Setup	Yes	Implementation Details. We conduct our experiments on LLa VA 1.5-7B [1], training only the image projection layer while keeping all other parameters frozen. For training, we use the above four hallucination-related preference datasets: RLHF-V [16] is trained for 2 epochs, while the remaining datasets are trained for 1 epoch each on NVIDIA A100 80GB. Hyperparameters for contrastive decoding follow the configuration recommended in VCD [15], ensuring consistency with this baseline approach. For the DPO baseline, we follow the training setting of BPO [18].