Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Balancing Multimodal Training Through Game-Theoretic Regularization

Authors: Konstantinos Kontras, Thomas Strypsteen, Christos Chatzichristos, Paul Liang, Matthew B Blaschko, Maarten De Vos

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We extensively evaluate MCR on synthetic datasets and several established real-world multimodal benchmarks, including action recognition on AVE [51] and UCF [47], emotion recognition on CREMA-D [4], human sentiment on CMU-MOSI [65], human emotions on CMU-MOSEI [67], and egocentric action recognition on Something-Something [14]. Our results demonstrate that MCR outperforms all previous methods and simple baseline, clearly demonstrating that training modalities jointly leads to important performance gains on both synthetic and large real-world datasets.
Researcher Affiliation Academia 1Department of Electrical Engineering, KU Leuven, Leuven, Belgium 2Department of Development and Regeneration, KU Leuven, Leuven, Belgium 3MIT Media Lab and EECS, Cambridge, MA, USA
Pseudocode Yes Algorithm 1 Multimodal Training with MCR
Open Source Code Yes MCR outperforms all previously suggested training strategies and simple baseline, clearly demonstrating that training modalities jointly leads to important performance gains on both synthetic and large real-world datasets. We release our code and models at https://github.com/kkontras/MCR.
Open Datasets Yes We extensively evaluate MCR on synthetic datasets and several established real-world multimodal benchmarks, including action recognition on AVE [51] and UCF [47], emotion recognition on CREMA-D [4], human sentiment on CMU-MOSI [65], human emotions on CMU-MOSEI [67], and egocentric action recognition on Something-Something [14].
Dataset Splits Yes AVE provides predefined training, validation, and test splits. Our std is derived from three random seeds on the same test set. ... For model evaluation, we utilize the 3-fold split offered [47], reporting the std across these folds. ... Results are reported based on three random seeds on the same validation set. ... We report standard deviation (std) across folds for consistency. ... Our dataset division follows Goncalves et al. [12], excluding actor overlap between training, validation, and test sets.
Hardware Specification Yes Lastly, all of our experiments run on single GPU with different nodes being used for different experiments. For the largest ones we utilized H100 with 80Gb vram to run the experiments of Sth-Sth which required the longest of all up to 48 hours per run. For the rest, we would have from some minutes on the smallest experiment up to 4-5 hours for the datasets CREMA-D, AVE and UCF depending on the GPU and the available RAM.
Software Dependencies No The paper mentions optimizers like Adam [26] and Adaw [36], and model architectures such as Res Net-18 [17], Transformer [53], Wav2Vec2 [3], Hu BERT [18], Vi Vi T [2], Conformer [15], and Swin Transformer [35]. However, it does not specify versions for underlying software dependencies such as deep learning frameworks (e.g., PyTorch, TensorFlow), programming languages (e.g., Python), or CUDA.
Experiment Setup Yes All models are optimized using Adam [26] with a cosine learning rate scheduler and a steady warm-up phase, except for the Something-Something dataset, where we use Adaw [36]. Early stopping is applied for all models, with maximum epochs set to 100 for Res Net and Transformer models, 50 for Conformer models, and 30 epochs in total for Swin Transformers without early stopping. Batch sizes are adjusted based on computational resources, with Res Nets and Transformers both using a batch size of 32, Conformers using 8, and Swin-TF using 16. ... We use different learning rates (lr) and weight decay (wd) values across experiments, tailored to each dataset and model. For Res Net models, we use lr = 1e 3 and wd = 1e 4 for CREMA-D, while both AVE and UCF use lr = 1e 4 and wd = 1e 4. For Transformer models, including MOSI and MOSEI on two and three modalities, the hyperparameters are consistent lr = 1e 4 and wd = 1e 4. Similarly, for Conformer models, we set lr = 5e 5 and wd = 5e 6 for CREMA-D, while AVE uses lr = 1e 4 and wd = 1e 4. Finally, for Swin-TF models trained on the Something-Something dataset, we configure lr = 1e 4 and wd = 0.02.