Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Learn from Downstream and Be Yourself in Multimodal Large Language Models Fine-Tuning
Authors: Wenke Huang, Jian Liang, Zekun Shi, Didi Zhu, Guancheng Wan, He Li, Bo Du, Dacheng Tao, Mang Ye
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct empirical evaluations on both image captioning and visual question-answering tasks using various MLLM architectures. The comprehensive experimental analysis demonstrates the effectiveness of the proposed solution, highlighting the efficiency of the crucial modules in enhancing downstream specialization performance while mitigating generalization degradation in MLLM Fine-Tuning. |
| Researcher Affiliation | Academia | 1National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, China 2Department of Computer Science and Technology, Zhejiang University, Hangzhou, China 3Nanyang Technological University, Singapore. Correspondence to: Mang Ye <EMAIL>, Bo Du <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 SPIDER Input: Fine-Tuning Epoch E, Overall MLLM Network θ, Trainable parameter module w, Frozen Pre-trained parameter weight w , Output: The optimized selected MLLM model w |
| Open Source Code | No | The paper mentions following official codebases (LLaVA and VILA) for fine-tuning procedures but does not explicitly state that their own methodology (SPIDER) has its source code released. The links provided are for the foundation models used, not the specific implementation of SPIDER. |
| Open Datasets | Yes | For fine-tuning tasks, we consider four downstream datasets: Flickr30k (Young et al., 2014), COCO-Capation (Lin et al., 2014), Icon QA (Lu et al., 2021), Science QA (Lu et al., 2022)1, which respectively associate with image caption and visual reasoning views. To be precise, OKVQA, Text VQA , GQA, and OCRVQA are obviously mentioned as the training datasets in the pre-training stage, making them appropriate benchmarks to evaluate multimodal large language models (MLLMs) generalization across diverse tasks. 1https://huggingface.co/datasets/BAAI/DataOptim |
| Dataset Splits | Yes | We follow (Zhou et al., 2024) resource setting and randomly sample 10k samples from the training set of each dataset. |
| Hardware Specification | Yes | All experiments are conducted on 8 NVIDIA 4090 GPUs, each with 24GB memory. |
| Software Dependencies | No | The paper provides details about learning rates and model architectures but does not specify software dependencies (e.g., libraries, frameworks) with version numbers. It refers to official codebases for LLaVA and VILA but doesn't list the software stack with versions. |
| Experiment Setup | Yes | The learning rate lr in LLaVA (Liu et al., 2023b) is 2e-4 for LLM and 2e-5 for visual projector. For VILA (Lin et al., 2023a), we uniformly set the learning rate to 1e-4. The training epoch is E = 5. The training batch size B set 16. |