Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Analyzing Fine-Grained Alignment and Enhancing Vision Understanding in Multimodal Language Models
Authors: Jiachen Jiang, Jinxin Zhou, Bo Peng, Xia Ning, Zhihui Zhu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that patch-aligned training (1) achieves stronger compression capability and improved patch-level alignment, enabling the MLLM to generate higher-quality captions, (2) improves the MLLM s performance by 16% on referring expression grounding tasks, 4% on question-answering tasks, and 3% on modern instruction-following benchmarks when using the same supervised fine-tuning (SFT) setting. The proposed method can be easily extended to other multimodal models. |
| Researcher Affiliation | Academia | Department of Computer Science and Engineering, The Ohio State University Translational Data Analytics Institute, The Ohio State University Department of Biomedical Informatics, The Ohio State University EMAIL |
| Pseudocode | Yes | Algorithm 1 Matching Pursuit for Vision Embedding |
| Open Source Code | Yes | To support future research, we publicly release both the annotation pipeline and the resulting dataset. |
| Open Datasets | Yes | Applying this pipeline to the 558K LLaVA pretraining dataset, we construct the Patch-Aligned Dataset (PAD), which provides extensive and diverse patch-level annotations. To support future research, we publicly release both the annotation pipeline and the resulting dataset. ... We evaluated these across 100 selected images from the COCO2017 dataset [39]. ... the Ref COCO[40], Ref COCO+[41], and Ref COCOg[41]. ... LLaVA-Instruct [1], Text VQA [48] , GQA [49], OCR-VQA [50], and Visual Genome[51]. |
| Dataset Splits | Yes | For a fair comparison, we follow LLaVA-1.5 [1] s architecture, training setup, and datasets. ... We evaluate these across 100 selected images from the COCO2017 dataset [39]. ... we conducted a thorough ablation study using the coco-val 2017 [39] dataset, which provides ground truth bounding boxes. ... For pretraining dataset, utilizing our automated annotation pipeline, we annotate the 558K subset of the LAION-CC-SBU dataset, which is used as the pretraining dataset of LLaVA. |
| Hardware Specification | Yes | Pretraining requires approximately 8 hours using 8 A5000 GPUs (24G), while visual instruction tuning takes about 10 hours for LLaVA-v1.5-7B on 8x H100 (80G). |
| Software Dependencies | No | The paper mentions specific models like CLIP-ViT-L@336px [33] and Vicuna-1.5-7B[47], and the AdamW optimizer, but does not provide specific version numbers for software libraries or environments (e.g., Python, PyTorch versions). |
| Experiment Setup | Yes | The parameter β follows a linear schedule, increasing from 0 to 5. ... we optimize all models for 1 epoch using the AdamW optimizer with a cosine learning schedule. The learning rates are set to 1e-3 for pretraining and 2e-5 for instruction tuning. |