Coupled Mamba: Enhanced Multimodal Fusion with Coupled State Space Model
Authors: Wenbing Li, Hang Zhou, Junqing Yu, Zikai Song, Wei Yang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on CMU-MOSEI, CH-SIMS, CH-SIMSV2, BRCA, MM-IMDB through multi-domain input verify the effectiveness of our model compared to current state-of-the-art methods, improved F1-Score by 0.4%, 0.9%, and 2.3% on the CMU-MOSEI, CH-SIMS and CH-SIMSV2 datasetes respectively, 49% faster inference and 83.7% GPU memory save. |
| Researcher Affiliation | Academia | Wenbing Li Hang Zhou Junqing Yu Zikai Song Wei Yang Huazhong University of Science and Technology {wenbingli, henrryzh, yjqing, skyesong, weiyangcs}@hust.edu.cn |
| Pseudocode | Yes | Algorithm 1: Coupled Mamba |
| Open Source Code | Yes | Code is available at https://github.com/hustcselwb/coupledmamba. |
| Open Datasets | Yes | We conduct experiments on five benchmark datasets (CMU-MOSEI, CH-SIMS [24], CHSIMSV2 [25], MM-IMDB and BRCA). |
| Dataset Splits | Yes | CMU-MOSEI dataset is an extension of CMU-MOSI, contains 22856 samples of movie review video clips. In this dataset, 16326 samples are used as the training set, and the remaining 1871 and 4659 samples are used as the validation set and test set respectively. |
| Hardware Specification | Yes | All experiments were conducted on a Linux workstation equipped with a single NVIDIA 32GB V100GPU and a 32-core Intel Xeon CPU. |
| Software Dependencies | Yes | The environment we use is python 3.10, cuda12.1, torch 2.12. |
| Experiment Setup | Yes | We use a hidden dimension size of 128, an expansion coefficient of 2, a convolution kernel size of 4, = dstate/8 as the configuration of each Mamba block, and a layer number of 3 to train our Coupled Mamba. We use Adam to optimize the model and set the learning rate to 0.0005 , weight decay coefficient is 0.0005, epoch is 150, the batch size is set to 1024, 128, 256 on CMU-MOSEI, CH-SIMS, and CH-SIMSV2. L1 loss is used as the loss function for the regression task, and cross entropy is used as the loss function for the classification task. |