Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding

Authors: Zhicheng Zhang, Wuyou Xia, Chenxi Zhao, Zhou Yan, Xiaoqiang Liu, Yongjie Zhu, Wenyu Qin, Pengfei Wan, Di Zhang, Jufeng Yang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on 21 benchmark datasets verify the effectiveness of MODA in perception, cognition, and emotion tasks.
Researcher Affiliation Collaboration Zhicheng Zhang 1 2 Wuyou Xia 1 Chenxi Zhao 1 Yan Zhou 3 Xiaoqiang Liu 3 Yongjie Zhu 3 Wenyu Qin 3 Pengfei Wan 3 Di Zhang 3 Jufeng Yang 1 2 1VCIP & TMCC & DISSec, College of Computer Science, Nankai University 2Pengcheng Laboratory 3Kuaishou Technology.
Pseudocode No The paper describes methods in paragraph text and mathematical formulations but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Source code and demo are available in https://zzcheng.top/MODA.
Open Datasets Yes Perception: Following (Tong et al., 2024a), we conduct experiments on 4 types of perception task (i.e., general, knowledge, ocr, and vision-centric) across 16 benchmarks: MME (Fu et al., 2023), MMBench (Liu et al., 2025), SEED (Li et al., 2024), GQA (Hudson & Manning, 2019), Science QA (Lu et al., 2022), MMMU (Yue et al., 2024), Math Vista (Lu et al., 2024), AI2D (Kembhavi et al., 2016), Chart QA (Masry et al., 2022), OCRBench (Liu et al., 2024), Text VQA (Singh et al., 2019), Doc VQA (Mathew et al., 2021), MMVP (Tong et al., 2024b), Realworld QA (x AI, 2024), and CV-Bench (Tong et al., 2024a). Cognition: Following (Dai et al., 2025), we conduct experiments on MMRole to evaluate role-playing performance from 8 aspects. Emotion: Following (Yang et al., 2023; Huang et al., 2024), we conduct experiments on 4 benchmark datasets. MVSA-S and MVSA-M (Niu et al., 2016) are datasets used for sentiment polarity classification [...] Tum Emo (Yang et al., 2021) is a multimodal dataset [...] HFM (Liu et al., 2022) is a multimodal dataset.
Dataset Splits No The paper states, "For a fair comparison, all models are trained on 700K data samples for 1 epoch," and mentions using a batch size of 2048, but it does not specify explicit training, validation, or test dataset splits in terms of percentages, counts, or references to predefined splits.
Hardware Specification No The paper mentions using specific visual encoders (CLIP (ViT-L/14)) and foundational large language models (Llama-3-Instruct-8B, Hermes2-Yi-34B) but does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used for the experiments.
Software Dependencies No The paper mentions the use of the AdamW optimizer and foundational models like Llama-3-Instruct-8B and Hermes2-Yi-34B, but it does not specify version numbers for any key software components or libraries (e.g., Python, PyTorch, CUDA).
Experiment Setup Yes MODA is trained for 1 epoch with a batch size of 2048, using the AdamW (Loshchilov & Hutter, 2019) optimizer with a cosine learning rate schedule. The learning rate is set to 2e-5 for LLM and 2e-6 for visual encoder, respectively. The warmup rate is 0.03.