ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models

Authors: Mingrui Wu, Xinyue Cai, Jiayi Ji, Jiale Li, Oucheng Huang, Gen Luo, Hao Fei, GUANNAN JIANG, Xiaoshuai Sun, Rongrong Ji

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The results demonstrate that our method exhibits out-of-domain generalization and interpretability.
Researcher Affiliation Collaboration Mingrui Wu1, Xinyue Cai1, Jiayi Ji1 , Jiale Li1, Oucheng Huang1, Gen Luo1, Hao Fei2, Guannan Jiang3, Xiaoshuai Sun1, Rongrong Ji1 1 Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China 2 National University of Singapore 3 CATL
Pseudocode No The paper provides mathematical formulations and descriptions of the approach, but it does not include a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes Code: https://github. com/mrwu-mac/Control MLLM.
Open Datasets Yes We follow the setting of Ferret to form 1,748 questions (in which 1,548 for test and 200 for validation) based on LVIS [25] validation dataset, with corresponding box, mask, scribble and point.
Dataset Splits Yes We follow the setting of Ferret to form 1,748 questions (in which 1,548 for test and 200 for validation) based on LVIS [25] validation dataset, with corresponding box, mask, scribble and point.
Hardware Specification Yes All experiments are conducted on two RTX 3090 GPUs with 24 GB of memory each.
Software Dependencies No The paper mentions using 'LLa VA-v1.5-7B [35]' as the MLLM, but it does not specify software dependencies like Python, PyTorch, or CUDA versions.
Experiment Setup Yes Unless explicitly stated otherwise, the MLLM we use is LLa VA-v1.5-7B [35], T=5, α=400 and β = 0.5.