Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MIDAS: Misalignment-based Data Augmentation Strategy for Imbalanced Multimodal Learning
Authors: Seong-Hyeon Hwang, Soyoung Choi, Steven Euijong Whang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on multiple multimodal classification benchmarks demonstrate that MIDAS significantly outperforms related baselines in addressing modality imbalance. We conduct comprehensive evaluations of MIDAS on multiple real-world multimodal datasets for classification. |
| Researcher Affiliation | Academia | Seong-Hyeon Hwang Soyoung Choi Steven Euijong Whang KAIST EMAIL |
| Pseudocode | Yes | A.1 The Algorithm of MIDAS Algorithm 1: The algorithm of MIDAS |
| Open Source Code | Yes | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We use open access datasets for experiments and provide the codes for conducting experiments in the supplemental material. |
| Open Datasets | Yes | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We use open access datasets for experiments and provide the codes for conducting experiments in the supplemental material. Datasets We evaluate our method and baselines on four widely used benchmarks for imbalanced multimodal learning, each exhibiting varying degrees and types of modality characteristics: Kinetics Sounds [15] is a dataset linking audio and video clips for action recognition with 31 classes. CREMA-D [31] is an audiovisual dataset for emotion recognition featuring actors speaking sentences with 6 classes. UCF-101 [32] is an action recognition dataset consisting of RGB frames and optical flows with 101 classes. Food-101 [33] is a dataset of food images paired with their corresponding textual recipes with 101 classes. Additional dataset statistics are summarized in the Appendix (Sec. A.3). |
| Dataset Splits | Yes | Table 5: Summary of datasets used in our experiments. Dataset #Train #Val #Test #Class Modality 1 Modality 2 Kinetics-Sounds 16,890 2,461 4,778 31 Audio Video CREMA-D 5,209 744 1,489 6 Audio Video UCF-101 9,159 1,308 2,618 101 Optical Flow RGB frame Food-101 63,481 9,069 18,138 101 Text Image |
| Hardware Specification | Yes | All experiments are conducted using NVIDIA Ge Force RTX A6000 and Quadro RTX 8000 GPUs. |
| Software Dependencies | No | The paper mentions encoders (ResNet-18, ELECTRA) and an optimizer (SGD) but does not provide specific version numbers for the software stack (e.g., Python, PyTorch, CUDA) required for replication. |
| Experiment Setup | Yes | Implementation Details We conduct experiments following the configurations in [6]. For Kinetics Sounds and CREMA-D, we use Res Net-18 [34] encoders for both audio and video, training from scratch. For UCF-101, we also use Res Net-18 as encoders. For the Food-101 dataset, we use a pre-trained Res Net-18 and a pre-trained ELECTRA [35] as image and text encoders, respectively. More detailed configurations are provided in the Appendix (Sec. A.3). Implementation details Continuing from Sec. 4.1, we provide additional configurations for implementing experiments. We use the SGD optimizer with momentum of 0.9 and an initial learning rate of 1e-3 for Kinetics-Sounds, CREMA-D, and Food-101, and 1e-2 for UCF-101. We use a batch size of 64 across all datasets. Models are trained for 30 epochs for Food-101 and 70 epochs for the other three datasets. We combine the features from different modalities by concatenation. The step size of the Step LR scheduler is 15 for Food-101 and 50 for the others. We apply a weight decay of 1e-4 and a Step LR learning rate schedule for all datasets. We use five workers for all experiments. For MIDAS, we use the hyperparameter λ of 5 for the Kinetics-Sounds datasets, and 1 for others. We also provide an analysis of the hyperparameter λ in Sec. A.4. We use η of 5e-2 for all datasets. The best model is selected based on validation accuracy. |