Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MokA: Multimodal Low-Rank Adaptation for MLLMs

Authors: Yake Wei, Yu Miao, Dongzhan Zhou, Di Hu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments cover three representative multimodal scenarios (audio-visual-text, visual-text, and speech-text), and multiple LLM backbones (LLa MA2/3, Qwen2, Qwen2.5-VL, etc). Consistent improvements indicate the efficacy and versatility of the proposed method. Ablation studies and efficiency evaluation are also conducted to fully asses our method.
Researcher Affiliation Collaboration 1Gaoling School of Artificial Intelligence, Renmin University of China 2Beijing Key Laboratory of Research on Large Models and Intelligent Governance 3Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE 4Shanghai Artificial Intelligence Laboratory EMAIL, EMAIL EMAIL, EMAIL
Pseudocode No The paper describes the method and its components in detail but does not provide a formal pseudocode block or algorithm listing.
Open Source Code Yes The project page is at https://gewu-lab.github.io/Mok A. ... Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide source code in the supplementary material.
Open Datasets Yes Video-LLa VA [19] used a mixed dataset of images and videos for video captioning and image captioning tasks. ... Audio Caps [14] dataset is used for the audio captioning task. ... Giga Speech-M [4] is a 1000h dataset for speech recognition task. ... MUSIC-AVQA [17] is an audio-visual-text dataset ... AVE [30] is an audio-visual-text dataset. ... LLa VA-Instruct-150K [20] is a set of GPT-generated multimodal instruction-following data. ... Libri Speech [23] is a 960-hour dataset. ... MMEpercep [9] is the perception subset of the MME benchmark ... MMBench [22]is a collection of benchmarks ... POPE [18] is a benchmark ... SEED-Bench [16] consists of 19K multiple-choice questions ... MMAUmini speech [27] is the speech subset of MMAU-mini benchmark. ... AIR-Benchspeech en [37] is the English speech subset of the foundation part of AIR-Bench.
Dataset Splits Yes Our experiment of MLLM follows the widely used two-stage training paradigm: pre-training stage that aims to cross-modal alignment and supervised instruction-tuning for downstream tasks. ... For the audio-visual-text case, the model is fine-tuned on the train set of MUSIC-AVQA [17], and AVE [30], respectively. ... Inference: To well assess the effectiveness of our fine-tuning strategy, we evaluate our trained models on in-domain test sets or public benchmarks. Details are provided in the supplementary materials. Audio-visual-text: in-domain test set of MUSIC-AVQA and AVE dataset.
Hardware Specification No The paper discusses efficiency evaluation in terms of FLOPs, memory usage, and average forward time per sample (Table 8), but it does not specify the actual hardware (e.g., GPU models, CPU models) used for these measurements or for running the experiments. The text mentions 'GPU memory usage' but no specific GPU type.
Software Dependencies No The paper mentions several models and encoders like 'CLIP-Vi T/L-14 [25] as the visual encoder', 'BEATs [5] encoder', 'Open AI s Whisper model [26]', 'LLa MA-2-7b-Chat [31]', 'LLa MA-3-8B-Instruct [10]', and 'Qwen2-7B-Instruct [36]'. However, it does not specify version numbers for programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or other key libraries.
Experiment Setup Yes During pre-training, using the Adam W optimizer with a cosine learning rate schedule. The initial learning rate is 1e 4 with a warmup ratio of 0.03. ... Trainable parameters include all projectors and our Mok A module. ... Rank of low-rank matrices is 4. The remaining settings are the same as the first stage.