Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Rethinking Multimodal Learning from the Perspective of Mitigating Classification Ability Disproportion
Authors: Qing-Yuan Jiang, Longfei Huang, Yang Yang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical experiments on widely used datasets reveal the superiority of our method through comparison with various state-of-the-art (SOTA) multimodal learning baselines. |
| Researcher Affiliation | Academia | Nanjing University of Science and Technology State Key Lab. for Novel Software Technology, Nanjing University, P.R. China EMAIL |
| Pseudocode | Yes | Algorithm 1 Learning algorithm of our proposed method. |
| Open Source Code | Yes | The source code is available at https://github. com/njustkmg/Neur IPS25-AUG. |
| Open Datasets | Yes | Dataset: We carry out the experiments on six extensive multimodal datasets, i.e., CREMAD [4], KSounds [2], NVGesture [31], VGGSound [6], Twitter [51], and Sarcasm [3] datasets. |
| Dataset Splits | Yes | Specifically, the CREMAD dataset contains 7,442 clips, which are divided into training set with 6,698 samples and testing set with 744 samples. For KSounds dataset, which contains 19,000 video clips, is divided into training set with 15,000 clips, validation set with 1,900 clips, and testing set with 1,900 clips. VGGSound dataset includes 168,618 videos for training and validation, and 13,954 videos for testing. The NVGesture dataset is divided into 1,050 samples for training and 482 samples for testing. Twitter dataset is divided into training set with 3,197 pairs, validation set with 1,122 pairs and testing set with 1,037 pairs. Sarcasm dataset includes 19,816 pairs for the training set, 2,410 pairs for the validation set, and 2,409 pairs for the testing set. |
| Hardware Specification | Yes | All experiments are conducted on an NVIDIA Ge Force RTX 4090 and all models are implemented with pytorch. |
| Software Dependencies | No | All experiments are conducted on an NVIDIA Ge Force RTX 4090 and all models are implemented with pytorch. |
| Experiment Setup | Yes | Following OGM [35], we employ Res Net18 [19] as the backbone to encode audio and video for CREMAD, KSounds and VGGSound datasets. All the parameters of the backbone are randomly initialized. For NVGesture dataset, we employ the I3D [5] as unimodal branch following the setting of [45]. We initialize the encoder with the pre-trained model trained on Image Net. For the architecture of the configurable classifier, we explore a two-layer network, which can be denoted as Layer1(D 256) 7 Re LU 7 Layer2 (256 K) . Here, D denotes the output dimensions of encoders, Layer1 / Layer2 are fully connected layer, and Re LU denotes the Re LU [1] activation layer. Furthermore, the Layer2 is utilized as shared head for all modalities as described in Section 3. Both Layer1 and Layer2 are randomly initialized. In addition, all hyper-parameters are selected by using the cross-validation strategy. Specifically, we use stochastic gradient descent (SGD) as the optimizer with a momentum of 0.9 and weight decay of 1 10 4. The initial learning rate is set to be 1 10 2 for CREMAD, KSounds, VGGSound , and NVGesture datasets. During training, the learning rate is progressively reduced by a factor of ten upon observing loss saturates. The batch size is set to be 64 for CREMAD and KSounds datasets, 16 for VGGSound dataset, and 2 for NVGesture dataset. We set the iteration t N for checking whether to assign the classifier to 20 epochs for CREMAD, 5 for Twitter, 1 for Sarcasm, and 10 for VGGSound, KSounds, NVGesture datasets. For all datasets, we search λ in {0.1, 0.2, 0.33, 0.5, 1.0}. For all datasets, σ and τ are set to be 1.0 and 0.01, respectively. For Twitter and Sarcasm dataset, following [51, 3], we adopt BERT [9] as the text encoder and Res Net50 [19] as the image encoder. We use Adam [26] as the optimizer, with an initial learning rate of 2 10 5. The batch size is set to 32 for Twitter and Sarcasm datasets. The other parameter settings are the same as audio-video datasets. |