Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Analyzing Fine-Grained Alignment and Enhancing Vision Understanding in Multimodal Language Models

Authors: Jiachen Jiang, Jinxin Zhou, Bo Peng, Xia Ning, Zhihui Zhu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that patch-aligned training (1) achieves stronger compression capability and improved patch-level alignment, enabling the MLLM to generate higher-quality captions, (2) improves the MLLM s performance by 16% on referring expression grounding tasks, 4% on question-answering tasks, and 3% on modern instruction-following benchmarks when using the same supervised fine-tuning (SFT) setting. The proposed method can be easily extended to other multimodal models.
Researcher Affiliation Academia Department of Computer Science and Engineering, The Ohio State University Translational Data Analytics Institute, The Ohio State University Department of Biomedical Informatics, The Ohio State University EMAIL
Pseudocode Yes Algorithm 1 Matching Pursuit for Vision Embedding
Open Source Code Yes To support future research, we publicly release both the annotation pipeline and the resulting dataset.
Open Datasets Yes Applying this pipeline to the 558K LLaVA pretraining dataset, we construct the Patch-Aligned Dataset (PAD), which provides extensive and diverse patch-level annotations. To support future research, we publicly release both the annotation pipeline and the resulting dataset. ... We evaluated these across 100 selected images from the COCO2017 dataset [39]. ... the Ref COCO[40], Ref COCO+[41], and Ref COCOg[41]. ... LLaVA-Instruct [1], Text VQA [48] , GQA [49], OCR-VQA [50], and Visual Genome[51].
Dataset Splits Yes For a fair comparison, we follow LLaVA-1.5 [1] s architecture, training setup, and datasets. ... We evaluate these across 100 selected images from the COCO2017 dataset [39]. ... we conducted a thorough ablation study using the coco-val 2017 [39] dataset, which provides ground truth bounding boxes. ... For pretraining dataset, utilizing our automated annotation pipeline, we annotate the 558K subset of the LAION-CC-SBU dataset, which is used as the pretraining dataset of LLaVA.
Hardware Specification Yes Pretraining requires approximately 8 hours using 8 A5000 GPUs (24G), while visual instruction tuning takes about 10 hours for LLaVA-v1.5-7B on 8x H100 (80G).
Software Dependencies No The paper mentions specific models like CLIP-ViT-L@336px [33] and Vicuna-1.5-7B[47], and the AdamW optimizer, but does not provide specific version numbers for software libraries or environments (e.g., Python, PyTorch versions).
Experiment Setup Yes The parameter β follows a linear schedule, increasing from 0 to 5. ... we optimize all models for 1 epoch using the AdamW optimizer with a cosine learning schedule. The learning rates are set to 1e-3 for pretraining and 2e-5 for instruction tuning.