Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
LLaVA-MoD: Making LLaVA Tiny via MoE-Knowledge Distillation
Authors: Fangxun Shu, Yue Liao, Lei Zhang, Le Zhuo, Chenning Xu, Guanghao Zhang, Haonan Shi, Weilong Dai, ZhongTao, Zhelun Yu, Wanggui He, Siming Fu, Haoyuan Li, Si Liu, Hongsheng Li, Hao Jiang
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that LLa VA-Mo D surpasses existing works across various benchmarks while maintaining minimal activated parameters and low computational costs. Remarkably, LLa VA-Mo D-2B surpasses Qwen-VL-Chat-7B with an average gain of 8.8%, using merely 0.3% of the training data and 23% trainable parameters. The results underscore LLa VA-Mo D s capability to effectively distill comprehensive knowledge from its teacher model, paving the way for the development of efficient MLLMs. The paper includes sections such as "4 EXPERIMENTS", "4.1 EXPERIMENTAL SETTINGS", "4.2 MAIN RESULTS", and "4.3 ABLATION STUDY", featuring tables with performance metrics (e.g., Table 1, Table 2). |
| Researcher Affiliation | Collaboration | 1Alibaba Group 2The Chinese University of Hong Kong 3University of California, San Diego 4Beihang University |
| Pseudocode | No | The paper describes methods using mathematical equations (e.g., equations 1-4) and diagrams (e.g., Figure 2, Figure 3), but it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block, nor structured steps formatted like code. |
| Open Source Code | Yes | The code is available at https: //github.com/shufangxun/LLa VA-Mo D. |
| Open Datasets | Yes | The training data consists of 5M samples from the open-source datasets... The detailed dataset of each training stage is illustrated in Appendix A.2. (e.g., LLa VA-1.5-Pretrain (Liu et al., 2023b), Share GPT4V-PT (Chen et al., 2023a), GQA (Hudson & Manning, 2019), VQAv2 (Goyal et al., 2017), Visual Genome (Krishna et al., 2017), Science QA (Lu et al., 2022a), RLAIF-V (Yu et al., 2024b) etc.) |
| Dataset Splits | Yes | We conduct experiments on MME (Fu et al., 2023), MMB (Liu et al., 2023c), and MMBCN. Each encompasses various sub-tasks, enabling comprehensive evaluation of multimodal understanding and reasoning capabilities. Additionally, we carry out experiments across a broad spectrum of VQA tasks, which include general VQA, text-oriented VQA, and science VQA. Specifically, for general VQA tasks, we use Viz Wiz (Gurari et al., 2018) and GQA (Hudson & Manning, 2019)... In Table 1: VQAT: Text VQA val, MMB: MMBench dev, MMBCN: MMBench-Chinese dev. |
| Hardware Specification | Yes | Throughout all stages, we employ the adam optimizer (Diederik, 2014) and train on 16 NVIDIA A100 GPUs for one epoch each, totaling approximately 960 GPU hours. For inference costs, our evaluation on a single A100-80G GPU indicates that LLa VAMo D-2B is 2.5 faster in decoding speed, consumes 26% of FLOPs, and uses 38% of the memory compared to Qwen-VL-Chat-7B. |
| Software Dependencies | No | The paper mentions using the 'adam optimizer (Diederik, 2014)' and 'KTO (Ethayarajh et al., 2024) loss', but does not provide specific version numbers for these or other software libraries (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | The detailed training strategy and hyperparameter are illustrated in Appendix A.1. Table 10: Training hyperparameters of each stage. Configuration Initialization Mimic Distillation Preference Distillation ... Learning rate 1e-4 2e-5 2e-5 Learning rate schedule Cosine decay Weight decay 0.0 Training epoch 1 Warm-up ratio 0.03 Global batch size 256 128 128 Numerical precision Bfloat16 Model parallelism Zero2 Zero2 offload. |