Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learn from Downstream and Be Yourself in Multimodal Large Language Models Fine-Tuning

Authors: Wenke Huang, Jian Liang, Zekun Shi, Didi Zhu, Guancheng Wan, He Li, Bo Du, Dacheng Tao, Mang Ye

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct empirical evaluations on both image captioning and visual question-answering tasks using various MLLM architectures. The comprehensive experimental analysis demonstrates the effectiveness of the proposed solution, highlighting the efficiency of the crucial modules in enhancing downstream specialization performance while mitigating generalization degradation in MLLM Fine-Tuning.
Researcher Affiliation Academia 1National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, China 2Department of Computer Science and Technology, Zhejiang University, Hangzhou, China 3Nanyang Technological University, Singapore. Correspondence to: Mang Ye <EMAIL>, Bo Du <EMAIL>.
Pseudocode Yes Algorithm 1 SPIDER Input: Fine-Tuning Epoch E, Overall MLLM Network θ, Trainable parameter module w, Frozen Pre-trained parameter weight w , Output: The optimized selected MLLM model w
Open Source Code No The paper mentions following official codebases (LLaVA and VILA) for fine-tuning procedures but does not explicitly state that their own methodology (SPIDER) has its source code released. The links provided are for the foundation models used, not the specific implementation of SPIDER.
Open Datasets Yes For fine-tuning tasks, we consider four downstream datasets: Flickr30k (Young et al., 2014), COCO-Capation (Lin et al., 2014), Icon QA (Lu et al., 2021), Science QA (Lu et al., 2022)1, which respectively associate with image caption and visual reasoning views. To be precise, OKVQA, Text VQA , GQA, and OCRVQA are obviously mentioned as the training datasets in the pre-training stage, making them appropriate benchmarks to evaluate multimodal large language models (MLLMs) generalization across diverse tasks. 1https://huggingface.co/datasets/BAAI/DataOptim
Dataset Splits Yes We follow (Zhou et al., 2024) resource setting and randomly sample 10k samples from the training set of each dataset.
Hardware Specification Yes All experiments are conducted on 8 NVIDIA 4090 GPUs, each with 24GB memory.
Software Dependencies No The paper provides details about learning rates and model architectures but does not specify software dependencies (e.g., libraries, frameworks) with version numbers. It refers to official codebases for LLaVA and VILA but doesn't list the software stack with versions.
Experiment Setup Yes The learning rate lr in LLaVA (Liu et al., 2023b) is 2e-4 for LLM and 2e-5 for visual projector. For VILA (Lin et al., 2023a), we uniformly set the learning rate to 1e-4. The training epoch is E = 5. The training batch size B set 16.