Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Data Selection Matters: Towards Robust Instruction Tuning of Large Multimodal Models

Authors: Xu Yang, Chen Liu, Ying Wei

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that ARDS substantially boosts both robustness and data efficiency for visual instruction tuning. ... Extensive experiments across eleven evaluation benchmarks validate the effectiveness of our curated robust training mixture.
Researcher Affiliation	Academia	1 City University of Hong Kong 2 Zhejiang University
Pseudocode	Yes	The overall procedure is summarized in Algorithm 1. ... The complete algorithm is shown in Algorithm 1 of Appendix I. ... In Algorithm 1, we outline our robustness-aware data selection procedure to curate the robust training mixture.
Open Source Code	Yes	Our code and selected datasets that have been demonstrated transferable across models are available at https://github.com/xyang583/ARDS.
Open Datasets	Yes	We use the original training corpus LLa VA-665K [63] for our robust training mixture curation without introducing any external data. ... LLa VA-1.5 dataset, which contains 665K multimodal conversations [63] collected from mixed sources, such as LLa VA-158K [64], VQAv2 [34], OKVQA [73], Ref COCO [46], and Text Caps [97].
Dataset Splits	Yes	We evaluate visual instruction tuning on the mixture with the size of 30 % training data selected by ARDS and by several baselines. ... We choose 30% as the default training budget in our main experiments, as it offers the best trade-off between data efficiency and both clean and robust performance.
Hardware Specification	Yes	All experiments are conducted on 8 Nvidia RTX A6000 GPUs.
Software Dependencies	No	The paper mentions software components like 'Deep Speed stage 3', 'Adam W' optimizer, and 'Low-Rank Adaptation (Lo RA) [41]' as part of the experimental setup, but does not provide specific version numbers for core software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup	Yes	A complete list of hyperparameters is provided in Table 8. ... Table 8: Hyperparameters of visual instruction tuning (a) LLa VA-1.5 (7B) Hyperparameter Finetune Lo RA rank 128 batch size 128 lr 2e-4 lr schedule cosine decay lr warmup ratio 0.03 weight decay 0 epoch 1 optimizer Adam W Deep Speed stage 3 (b) LLa VA-1.5 (13B) Hyperparameter Finetune Lo RA rank 128 batch size 128 lr 2e-5 lr schedule cosine decay lr warmup ratio 0.03 weight decay 0 epoch 1 optimizer Adam W Deep Speed stage 3