Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Analyzing Fine-Grained Alignment and Enhancing Vision Understanding in Multimodal Language Models

Authors: Jiachen Jiang, Jinxin Zhou, Bo Peng, Xia Ning, Zhihui Zhu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that patch-aligned training (1) achieves stronger compression capability and improved patch-level alignment, enabling the MLLM to generate higher-quality captions, (2) improves the MLLM s performance by 16% on referring expression grounding tasks, 4% on question-answering tasks, and 3% on modern instruction-following benchmarks when using the same supervised fine-tuning (SFT) setting. The proposed method can be easily extended to other multimodal models.
Researcher Affiliation	Academia	Department of Computer Science and Engineering, The Ohio State University Translational Data Analytics Institute, The Ohio State University Department of Biomedical Informatics, The Ohio State University EMAIL
Pseudocode	Yes	Algorithm 1 Matching Pursuit for Vision Embedding
Open Source Code	Yes	To support future research, we publicly release both the annotation pipeline and the resulting dataset.
Open Datasets	Yes	Applying this pipeline to the 558K LLaVA pretraining dataset, we construct the Patch-Aligned Dataset (PAD), which provides extensive and diverse patch-level annotations. To support future research, we publicly release both the annotation pipeline and the resulting dataset. ... We evaluated these across 100 selected images from the COCO2017 dataset [39]. ... the Ref COCO[40], Ref COCO+[41], and Ref COCOg[41]. ... LLaVA-Instruct [1], Text VQA [48] , GQA [49], OCR-VQA [50], and Visual Genome[51].
Dataset Splits	Yes	For a fair comparison, we follow LLaVA-1.5 [1] s architecture, training setup, and datasets. ... We evaluate these across 100 selected images from the COCO2017 dataset [39]. ... we conducted a thorough ablation study using the coco-val 2017 [39] dataset, which provides ground truth bounding boxes. ... For pretraining dataset, utilizing our automated annotation pipeline, we annotate the 558K subset of the LAION-CC-SBU dataset, which is used as the pretraining dataset of LLaVA.
Hardware Specification	Yes	Pretraining requires approximately 8 hours using 8 A5000 GPUs (24G), while visual instruction tuning takes about 10 hours for LLaVA-v1.5-7B on 8x H100 (80G).
Software Dependencies	No	The paper mentions specific models like CLIP-ViT-L@336px [33] and Vicuna-1.5-7B[47], and the AdamW optimizer, but does not provide specific version numbers for software libraries or environments (e.g., Python, PyTorch versions).
Experiment Setup	Yes	The parameter β follows a linear schedule, increasing from 0 to 5. ... we optimize all models for 1 epoch using the AdamW optimizer with a cosine learning schedule. The learning rates are set to 1e-3 for pretraining and 2e-5 for instruction tuning.