Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Efficient Multi-modal Large Language Models via Progressive Consistency Distillation
Authors: Zichen Wen, Shaobo Wang, Yufa Zhou, Junyuan Zhang, Qintong Zhang, Yifeng Gao, Zhaorun Chen, Bin Wang, Weijia Li, Conghui He, Linfeng Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate the superior effectiveness, robustness, and generalization capabilities of our proposed framework. 4 Experiments 4.2 Experimental Results on Benchmarks 5.1 Ablation Studies |
| Researcher Affiliation | Academia | 1EPIC Lab, Shanghai Jiao Tong University 2Shanghai AI Laboratory 3Duke University 4The University of Hong Kong 5Peking University 6University of Chicago 7Sun Yat-sen University EMAIL, EMAIL |
| Pseudocode | No | The paper describes methodologies and theoretical intuitions in text and mathematical formulas within Section 3, but it does not contain any explicitly labeled pseudocode, algorithm blocks, or structured code-like procedures. |
| Open Source Code | No | Code: https://github.com/Zichen Wen1/EPIC 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: In accordance with the requirements of the supporting organization, the code will be released after the review process is completed. We are committed to providing sufficient instructions to ensure reproducibility at that time. |
| Open Datasets | Yes | We implement EPIC based on LLa VA [41, 40] without introducing any modifications to the model architecture. Specifically, we adopt CLIP Vi T-L/14 [49] as our vision encoder, utilizing its officially pretrained projector, and Vicuna-v1.5 [12] as our LLM. algname only requires performing the second stage training, which involves visual instruction tuning on the LLa VA-665K instruction fine-tuning dataset. Evaluation Benchmarks. We evaluate our model across 10 representative visual understanding benchmarks. Further details about benchmarks can be found in the Appendix D.3. MME [21] is a comprehensive benchmark for evaluating the performance of MLLMs in multimodal tasks. MMBench [45] employs a dual approach: it provides an extensive dataset that broadens the range and variety of evaluation questions, and introduces the innovative Circular Eval strategy, which uses Chat GPT to convert free-form predictions into structured choices. Science QA [47] is a multi-modal benchmark aimed at assessing and diagnosing AI systems multi-hop reasoning and interpretability in the science domain. GQA [27] is a dataset designed for advanced visual reasoning in real-world scenarios. POPE [36] is an evaluation method for examining object hallucination in MLLMs. VQA V2 [22] evaluates the model s visual perception capabilities through open-ended questions. Text VQA [51] focuses on the comprehensive integration of diverse text information within images. OCRBench [46] is a comprehensive benchmark for evaluating the OCR capabilities of multi-modal language models. |
| Dataset Splits | Yes | The training process, conducted on 8 A100 GPUs, takes approximately 12 hours. Furthermore, we faithfully reproduced the entire LLa VA training process following the official LLa VA-v1.5 training guidelines. All other token compression baselines are either trained following the settings provided in the original papers or evaluated using publicly available model checkpoints. For the ablation study in Section 5.1, the experiment without the distillation loss was trained using the same LLa VA-665K SFT data, with all other training parameters and procedures kept identical to those of EPIC. |
| Hardware Specification | Yes | Table 2: Inference efficiency analysis of EPIC. denotes the reduction ratio. All experiments are on POPE (8, 910 samples) using an A100 GPU. The training process, conducted on 8 A100 GPUs, takes approximately 12 hours. |
| Software Dependencies | No | We implement EPIC based on LLa VA [41, 40] without introducing any modifications to the model architecture. Specifically, we adopt CLIP Vi T-L/14 [49] as our vision encoder, utilizing its officially pretrained projector, and Vicuna-v1.5 [12] as our LLM. Table 7: Detailed hyperparameter settings. Settings Stage 2 Batch size 128 Learning rate 2e-5 Learning schedule Cosine decay Warmup ratio 0.03 Weight decay 0 Epoch 1 Optimizer Adam W Deep Speed stage 3 Max token 2048 |
| Experiment Setup | Yes | For a fair comparison, our framework not only adheres to the same model architecture and instructiontuning data as vanilla LLa VA but also maintains identical hyperparameter settings. Moreover, since we have not modified the model architecture, we can directly use the pre-trained projector unlike MQT-LLa VA, QT-LLa VA, and LLa VA-Mini, which require mandatory Stage 1 pre-training. Table 7: Detailed hyperparameter settings. Settings Stage 2 Batch size 128 Learning rate 2e-5 Learning schedule Cosine decay Warmup ratio 0.03 Weight decay 0 Epoch 1 Optimizer Adam W Deep Speed stage 3 Max token 2048 |