Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning to Instruct for Visual Instruction Tuning

Authors: Zhihan Zhou, Feng Hong, JIAAN LUO, Yushi Ye, Jiangchao Yao, Dongsheng Li, Bo Han, Ya Zhang, Yanfeng Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments across a range of 16 tasks, comparing L2T and VIT on different models. ... Experimental results demonstrate the effectiveness of L2T across 16 multimodal tasks, highlighting its superior performance on OCR and image captioning tasks by placing greater emphasis on visual content.
Researcher Affiliation Collaboration 1Cooperative Medianet Innovation Center, Shanghai Jiao Tong University 2Microsoft Research Asia 3Hong Kong Baptist University 4School of Artificial Intelligence, Shanghai Jiao Tong University 5Institute of Artificial Intelligence for Medicine, Shanghai Jiao Tong University School of Medicine EMAIL
Pseudocode No The paper describes its methodology and loss functions mathematically and in prose, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Github code: https://github.com/Feng-Hong/L2T.
Open Datasets Yes We adopt the LLa VA 1.5 framework, utilizing the LLa VA-pretrain-558k data [Liu et al., 2024a] for all pretraining phases. (2) Finetuning Stage: For instruction tuning, we use the LLa VA-mix-665k data [Liu et al., 2024a] on Tiny LLa VA and LLa VA 1.5. For LLa VA-Ne XT, we use the LLa VA-Ne XT-Data [Liu et al., 2024b], an expansion of LLa VA-mix-665k with diverse instruction data. ... (1) General Visual Question Answering, assessed through VQAv2 [Goyal et al., 2017], GQA [Hudson and Manning, 2019], Science QA [Lu et al., 2022], and Viz Wiz [Gurari et al., 2018]; (2) Comprehensive Multimodal Benchmarks, evaluated using MME [Fu et al., 2023], MMMU [Yue et al., 2024], and MMStar [Chen et al., 2024b]; (3) Chart, Document, and OCR Understanding, evaluated using Chart QA [Masry et al., 2022], Text VQA [Singh et al., 2019], Doc VQA [Mathew et al., 2021], and OCR Bench [Liu et al., 2023b]; and (4) Image Captioning, assessed through COCO2017 [Lin et al., 2014], Flickr30k [Young et al., 2014], No Caps [Agrawal et al., 2019], Ref COCO [Kazemzadeh et al., 2014], and Text Caps [Sidorov et al., 2020].
Dataset Splits No The paper references various datasets used for pretraining, finetuning, and evaluation. While it mentions 'pretraining phase' and 'finetuning phase' datasets (e.g., LLaVA-pretrain-558k, LLaVA-mix-665k), and evaluates on test data for benchmarks (e.g., 'Doc VQA test data' in Figure 5), it does not explicitly provide the specific training/validation/test splits (percentages, counts, or explicit standard split references) used for its own experiments beyond generally stating the datasets used for training and evaluation.
Hardware Specification Yes We train all models on NVIDIA A100 GPUs... The experiments are conducted on 8 NVIDIA A100 GPUs.
Software Dependencies No The paper mentions models like Qwen-2-0.5B and specific optimizers like AdamW, but it does not provide specific version numbers for software libraries or dependencies (e.g., PyTorch 1.x, CUDA 11.x) that would be needed for exact reproduction.
Experiment Setup Yes Implementation Details. We train all models on NVIDIA A100 GPUs, strictly following the training recipes of Tiny LLa VA, LLa VA 1.5 and LLa VA-Ne XT. See Appendix B.1 for more details. ... Table 10: Training Hyperparameters for Tiny LLa VA, LLa VA 1.5 and LLa VA-Ne XT. Hyperparameter: Learning rate (LR), LR warmup ratio, Batch size, LR schedule, Epoch, Optimizer, Trainable parameters. (e.g., LR 1e-3, Batch size 256 for pretrain; LR 2e-5, Batch size 128 for finetune for Tiny LLa VA & LLa VA 1.5).