Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Data Selection Matters: Towards Robust Instruction Tuning of Large Multimodal Models
Authors: Xu Yang, Chen Liu, Ying Wei
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that ARDS substantially boosts both robustness and data efficiency for visual instruction tuning. ... Extensive experiments across eleven evaluation benchmarks validate the effectiveness of our curated robust training mixture. |
| Researcher Affiliation | Academia | 1 City University of Hong Kong 2 Zhejiang University |
| Pseudocode | Yes | The overall procedure is summarized in Algorithm 1. ... The complete algorithm is shown in Algorithm 1 of Appendix I. ... In Algorithm 1, we outline our robustness-aware data selection procedure to curate the robust training mixture. |
| Open Source Code | Yes | Our code and selected datasets that have been demonstrated transferable across models are available at https://github.com/xyang583/ARDS. |
| Open Datasets | Yes | We use the original training corpus LLa VA-665K [63] for our robust training mixture curation without introducing any external data. ... LLa VA-1.5 dataset, which contains 665K multimodal conversations [63] collected from mixed sources, such as LLa VA-158K [64], VQAv2 [34], OKVQA [73], Ref COCO [46], and Text Caps [97]. |
| Dataset Splits | Yes | We evaluate visual instruction tuning on the mixture with the size of 30 % training data selected by ARDS and by several baselines. ... We choose 30% as the default training budget in our main experiments, as it offers the best trade-off between data efficiency and both clean and robust performance. |
| Hardware Specification | Yes | All experiments are conducted on 8 Nvidia RTX A6000 GPUs. |
| Software Dependencies | No | The paper mentions software components like 'Deep Speed stage 3', 'Adam W' optimizer, and 'Low-Rank Adaptation (Lo RA) [41]' as part of the experimental setup, but does not provide specific version numbers for core software dependencies such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | A complete list of hyperparameters is provided in Table 8. ... Table 8: Hyperparameters of visual instruction tuning (a) LLa VA-1.5 (7B) Hyperparameter Finetune Lo RA rank 128 batch size 128 lr 2e-4 lr schedule cosine decay lr warmup ratio 0.03 weight decay 0 epoch 1 optimizer Adam W Deep Speed stage 3 (b) LLa VA-1.5 (13B) Hyperparameter Finetune Lo RA rank 128 batch size 128 lr 2e-5 lr schedule cosine decay lr warmup ratio 0.03 weight decay 0 epoch 1 optimizer Adam W Deep Speed stage 3 |