Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Mitigating Hallucination Through Theory-Consistent Symmetric Multimodal Preference Optimization

Authors: Wenqi Liu, Xuemeng Song, Jiaxi Li, Yinwei Wei, Na Zheng, Jianhua Yin, Liqiang Nie

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive evaluation across five benchmarks demonstrate Sym MPO s superior performance, validating its effectiveness in hallucination mitigation of MLLMs. Our codes are available at https://github.com/Liuwq-bit/Sym MPO.
Researcher Affiliation	Academia	1Shandong University 2Southern University of Science and Technology 3University of Georgia 4National University of Singapore 5Harbin Institute of Technology (Shenzhen)
Pseudocode	No	The paper only describes methods and pipelines, such as the "Caption-Anchored Claim Extraction-and-Rewriting pipeline" in Section 4.1 and illustrated in Figure 2, but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our codes are available at https://github.com/Liuwq-bit/Sym MPO.
Open Datasets	Yes	In this work, we adopt the same set of 21.4k image-prompt pairs from TPO [26], which aggregates multiple public datasets, including VQA v2 [27], MSCOCO [28] and Text VQA [29].
Dataset Splits	No	The paper states, "In this work, we adopt the same set of 21.4k image-prompt pairs from TPO [26], which aggregates multiple public datasets, including VQA v2 [27], MSCOCO [28] and Text VQA [29]." It then describes how preference data is constructed from these pairs. However, it does not explicitly specify how this set of 21.4k image-prompt pairs is split into training, validation, or test sets for their model's training process. Evaluation is performed on separate established benchmarks.
Hardware Specification	Yes	The training is performed on 4 NVIDIA A100-40GB GPUs.
Software Dependencies	No	The paper mentions several models and frameworks used (e.g., LLaVA-1.5, GPT-4, DeepSeek-V3, Qwen2.5-VL-32B, CLIP, FLUX.1-dev) but does not provide specific version numbers for general software dependencies such as programming languages, deep learning frameworks, or libraries (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	All models (DPO, m DPO, and Sym MPO) for rigorous comparison are trained for 2 epochs with a learning rate of 5e-6 and batch size of 64, using the following hyper-parameters: β = 0.1, δ = 0, λ = 0.5, γ = 1e 4, and η = 1.0.