Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs
Authors: Xudong Li, Mengdan Zhang, Peixian Chen, Xiawu Zheng, Yan Zhang, Jingyuan Zheng, Yunhang Shen, Ke Li, Chaoyou Fu, Xing Sun, Rongrong Ji
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that Cc DPO significantly reduces hallucinations and yields consistent performance gains across general single- and multi-image tasks. Codes are available at https://github.com/LXDxmu/Cc DPO. 1 Introduction ... 5 Experiments 5.1 Experimental Settings and Evaluation Benchmarks 5.2 Main Results 5.3 Ablation Studies |
| Researcher Affiliation | Collaboration | 1 Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China 2 Tencent Youtu Lab 3 Nanjing University |
| Pseudocode | No | The paper describes methods using structured text and mathematical equations (Eq. 1, Eq. 2) but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Codes are available at https://github.com/LXDxmu/Cc DPO. |
| Open Datasets | Yes | To support these two-level optimization objectives, we introduce Multi Scope-42k, a scalable multiimage preference dataset. The dataset comprises high-quality chosen responses synthesized by splicing together accurate imageand region-level descriptions alongside rejected responses generated through targeted perturbations at both contextual and local detail levels. ... We use LLa VA-23K [61] and COCO [62] as our detailed and brief context caption pool, respectively. ... We utilize MDVP [63] for the region-level caption pool. ... We use the MVC [49] dataset as a region-level visual counterfactual caption pool. |
| Dataset Splits | Yes | Evaluation Benchmarks. We employ seven multi-image benchmarks MUIRBench [42], MIRB [68], BLINK [69], Mantis-Eval [44], NLVR2 [70], Q-Bench2 [71], and MIBench [72] to holistically evaluate multi-image reasoning across four key dimensions: co-reference alignment, fine-grained comparison, contextual reasoning, and temporal understanding. Complementing these, eight representative single-image benchmarks assess specific multimodal capabilities: (1) Academic/Scientific Reasoning: MMMU [73], MMStar [74], Science QA [75], (2) Diagram Understanding: AI2D [76], (3) Robustness against hallucinations: POPE [77], Hall Bench [38], (4) General Multimodal Abilities: MMBench [78], (5) Text Recognition: OCRBench [79]. |
| Hardware Specification | Yes | All training is conducted on eight GPUs, each equipped with 90GB of memory. |
| Software Dependencies | No | The paper mentions base MLLMs like Qwen2-VL [33] and LLa VA-OV [32] but does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | Our model undergoes a three-stage sequential training process to better understand multi-image preferences at both broad (context) and detailed (needle) levels. Stage 1 focuses on context-level alignment, where we fine-tune Qwen2-VL-7B and LLa VA-OV-7B for one epoch with learning rates of 5 10 6 and 5 10 5, respectively, using Eq. 1. Stage 2 applies needle-level language-based DPO using Eq. 1 to improve sensitivity to fine-grained visual cues with the same learning rate of 5 10 5. We conduct Stage 1 and Stage 2 by using Lo RA adaptation [67] with rank r = 128 for efficiency. Stage 3 performs vision contrastive DPO via full-parameter tuning for one epoch with a learning rate of 1 10 6 using Eq. 2, strengthening the model s ability to distinguish preferred visual content. Following the setup in [27], we set the temperature parameter β = β1 = β2 = 0.1 and the negative log-likelihood (NLL) loss coefficient γ = 0.1. |