Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Mitigating Hallucination Through Theory-Consistent Symmetric Multimodal Preference Optimization
Authors: Wenqi Liu, Xuemeng Song, Jiaxi Li, Yinwei Wei, Na Zheng, Jianhua Yin, Liqiang Nie
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive evaluation across five benchmarks demonstrate Sym MPO s superior performance, validating its effectiveness in hallucination mitigation of MLLMs. Our codes are available at https://github.com/Liuwq-bit/Sym MPO. |
| Researcher Affiliation | Academia | 1Shandong University 2Southern University of Science and Technology 3University of Georgia 4National University of Singapore 5Harbin Institute of Technology (Shenzhen) |
| Pseudocode | No | The paper only describes methods and pipelines, such as the "Caption-Anchored Claim Extraction-and-Rewriting pipeline" in Section 4.1 and illustrated in Figure 2, but does not contain explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our codes are available at https://github.com/Liuwq-bit/Sym MPO. |
| Open Datasets | Yes | In this work, we adopt the same set of 21.4k image-prompt pairs from TPO [26], which aggregates multiple public datasets, including VQA v2 [27], MSCOCO [28] and Text VQA [29]. |
| Dataset Splits | No | The paper states, "In this work, we adopt the same set of 21.4k image-prompt pairs from TPO [26], which aggregates multiple public datasets, including VQA v2 [27], MSCOCO [28] and Text VQA [29]." It then describes how preference data is constructed from these pairs. However, it does not explicitly specify how this set of 21.4k image-prompt pairs is split into training, validation, or test sets for their model's training process. Evaluation is performed on separate established benchmarks. |
| Hardware Specification | Yes | The training is performed on 4 NVIDIA A100-40GB GPUs. |
| Software Dependencies | No | The paper mentions several models and frameworks used (e.g., LLaVA-1.5, GPT-4, DeepSeek-V3, Qwen2.5-VL-32B, CLIP, FLUX.1-dev) but does not provide specific version numbers for general software dependencies such as programming languages, deep learning frameworks, or libraries (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | All models (DPO, m DPO, and Sym MPO) for rigorous comparison are trained for 2 epochs with a learning rate of 5e-6 and batch size of 64, using the following hyper-parameters: β = 0.1, δ = 0, λ = 0.5, γ = 1e 4, and η = 1.0. |