Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Visual Instruction Bottleneck Tuning

Authors: Changdae Oh, Jiatong Li, Shawn Im, Sharon Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical validation of multiple MLLMs on open-ended and closed-form question answering and object hallucination detection tasks over 45 datasets, including 30 shift scenarios, demonstrates that Vittle consistently improves the MLLM s robustness under shifts by pursuing the learning of a minimal sufficient representation.
Researcher Affiliation	Academia	Changdae Oh Jiatong Li Shawn Im Sharon Li Department of Computer Sciences, University of Wisconsin Madison EMAIL
Pseudocode	Yes	Figure 9: Py Torch-style pseudo code for the forward pass and training objective of Vittle
Open Source Code	Yes	Code: https://github.com/deeplearning-wisc/vittle
Open Datasets	Yes	For fair comparison, all models are trained on the LLa VA-pretrain-558k and LLa VA-mix-665k datasets, consisting of a mixture of LAION [93], CC [94], SBU [95] datasets with BLIP captions [96] and a mixture of LLa VA-instruct-158K and academic-task-oriented (V)QA datasets, respectively. ... We adopt LLa VA-Bench COCO (LB-COCO; [38]) as a typical in-distribution (ID) open-ended QA dataset... A representative benchmark for this is the POPE dataset [77]... We consider four representative datasets: Science QA [78], MMMU [79], MME [5], and MMStar [80]
Dataset Splits	Yes	We follow the standard two-stage training of LLa VA [38]... For all of our open-ended QA evaluations, we used the same system prompt template provided by LLa VA authors10, and we also adopted the MS-COCO annotation11-based GPT-4 response12 and the gpt_answer13 released by LLa VA-Ne XT authors as reference answers for LB-COCO variants and LB-Wilder, respectively. For LB-Wild and WV-Bench, we generated reference answers with GPT-4o.
Hardware Specification	Yes	All training runs are conducted with eight A100-80GB GPUs with Deep Speed Ze RO library. The shortest run takes roughly 11 hours, whereas the longest run takes about 14 hours.
Software Dependencies	No	All training runs are conducted with eight A100-80GB GPUs with Deep Speed Ze RO library. The paper mentions the 'Deep Speed Ze RO library' but does not specify a version number for it or any other software components.
Experiment Setup	Yes	Table 7: Hyperparameter list of Vittle training. We adopt exactly the same configurations with LLa VA-v1.5 [70] for Stage 1 and 2. ... Table 8: Hyperparameter list of Prism Vittle training. We adopt exactly the same configurations with Prism-DINOSig LIPControlled-7B [73] single stage training.