Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
CMoB: Modality Valuation via Causal Effect for Balanced Multimodal Learning
Authors: Jun Wang, Fuyuan CAO, Zhixin Xue, Xingwang Zhao, Jiye Liang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on benchmark multimodal datasets and multimodal frameworks demonstrate the superiority of our CMo B approach for balanced multimodal learning. |
| Researcher Affiliation | Academia | Jun Wang1, Fuyuan Cao1,2 , Zhixin Xue1, Xingwang Zhao1, Jiye Liang1 1School of Computer and Information Technology, Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, Taiyuan, China 2Shanxi Taihang Laboratory, Taiyuan, China EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 : CMod B Algorithm |
| Open Source Code | Yes | We make our code publicly available at https://github.com/ perpetual1859/CMo B , including data and detail documentation. |
| Open Datasets | Yes | Datasets: We select five public datasets,including CREMA-D[43], Kinetic Sounds[44], UCF-101[45], CMU-MOSEI[46], and NVGesture[47] datasets to validate our proposed method. The description of the complete dataset is provided in Appendix A.1. |
| Dataset Splits | Yes | UCF-101 is an action recognition dataset with two modalities, RGB and optical fow. This dataset contains 10l categories of human actions with 9,537 samples in the training set and 3,783 samples in the test set. |
| Hardware Specification | Yes | All models are trained on 2 NVIDIA DGX A100. |
| Software Dependencies | No | In our experiments, we use the raw data for experiments. Following [17, 12, 51], the architecture and initialization setup followed an unbalanced multimodal learning study for a fair comparison. For the CREMA-D and the Kinetic Sounds dataset, Res Net-18 is employed as the backbone for processing both audio and video data and trained from scratch. For the CMU-MOSEI dataset, we employ transformer-based networks as the backbone architecture, training the model from scratch. Encoders used for UCF-101 are Image Net pre-trained. In term of video and optical flow modalities, we first select 10 frames from each clip and then uniformly sample three frames as input. We adjusted the input channels of Res Net18 from three to one to fit our data format. For audio modal data, we convert to a 257 299 spectrograms for CREMA-D and a 257 1004 spectrograms for Kinetics Sounds. For text-image datasets, our framework employs Res Net-50 as the image encoder and BERT for text processing, where images are resized to 224 224 resolution and text sequences are truncated to a maximum length of 128 characters. During training, we use the SGD optimizer with momentum (0.9) and set the learning rate at 1 10 3 . |
| Experiment Setup | Yes | During training, we use the SGD optimizer with momentum (0.9) and set the learning rate at 1 10 3 . |