Improving Context Understanding in Multimodal Large Language Models via Multimodal Composition Learning
Authors: Wei Li, Hehe Fan, Yongkang Wong, Yi Yang, Mohan Kankanhalli
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on both retrieval tasks (i.e., zero-shot composed image retrieval, visual storytelling image retrieval and visual dialog image retrieval) and text generation tasks (i.e., visual question answering) demonstrate the effectiveness of the proposed method. |
| Researcher Affiliation | Academia | Part of this work was done when Wei Li was an Intern at National University of Singapore. 1Re LER, CCAI, School of Computer Science and Technology, Zhejiang University, China. 2School of Computing, National University of Singapore, Singapore. |
| Pseudocode | No | The paper describes its method in prose and diagrams (e.g., Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at: https://github.com/dhg-wei/MCL. |
| Open Datasets | Yes | It costs approximately 60 A100 GPU days to generate 2.7 million tuples, using image-caption pairs from CC3M (Sharma et al., 2018) as source pairs. |
| Dataset Splits | Yes | We evaluate MCL on three zero-shot CIR benchmarks: CIRCO (Baldrati et al., 2023), CIRR (Liu et al., 2021a) and Gene CIS (Vaze et al., 2023). Figure 5 shows more qualitative results from the CIRCO validation set. |
| Hardware Specification | Yes | It costs approximately 60 A100 GPU days to generate 2.7 million tuples |
| Software Dependencies | No | The paper mentions models like "CLIP Vi T-L/14", "OPT-2.7B", "OPT6.7B", and "Llama2-7B" but does not specify software dependencies with version numbers (e.g., Python, PyTorch, or other libraries). |
| Experiment Setup | Yes | MCL is trained on MMC for 50,000 iterations with a batchsize of 64. Both the LLM and CLIP model are frozen. The loss weights λCap and λRet in Equation 7 is set to 0.5 and 1.0 respectively. The temperature τ in Equation 3 and Equation 4 is set to 0.07. |