Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Adaptive Re-calibration Learning for Balanced Multimodal Intention Recognition

Authors: Qu Yang, Xiyang Li, Fu Lin, Mang Ye

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on multiple MIR benchmarks demonstrate that ARL significantly outperforms existing methods in both accuracy and robustness, particularly under noisy or modality-degraded conditions. Extensive experiments on multiple MIR benchmarks demonstrate that ARL consistently outperforms state-of-the-art methods in both accuracy and robustness, especially under noisy or modality-degraded conditions. We conduct extensive experiments on multiple MIR benchmarks, demonstrating that ARL consistently outperforms state-of-the-art methods in both accuracy and robustness, especially under noisy or modality-degraded conditions.
Researcher Affiliation	Academia	1 School of Computer Science, Wuhan University, Wuhan, China 2 Taikang Center for Life and Medical Sciences, Wuhan University, Wuhan, China EMAIL
Pseudocode	Yes	To provide a clear, step-by-step overview of the entire training process, we include a detailed pseudocode of the ARL training procedure in Appendix A.4
Open Source Code	Yes	https://github.com/yan9qu/Neur IPS25-ARL
Open Datasets	Yes	We evaluate our ARL framework on two multimodal benchmarks: MInt Rec [21] for intent recognition, and MOSI [43] for sentiment analysis.
Dataset Splits	Yes	The MInt Rec dataset comprises 2,224 samples, split into 1,334 for training, 445 for validation, and 445 for testing. It supports two levels of intent classification: coarse-grained, with binary labels distinguishing between expressing emotions and achieving goals, and fine-grained, with twenty labels (11 for expressing emotions and 9 for achieving goals). The MOSI dataset consists of 2,199 samples, divided into 1,284 for training, 229 for validation, and 686 for testing, with sentiment scores ranging from -3 (highly negative) to 3 (highly positive).
Hardware Specification	Yes	All experiments are conducted on 4 NVIDIA 4090 GPUs.
Software Dependencies	No	The paper mentions 'We adopt hyper-parameters such as learning rate, batch size, and optimizer settings from the publicly released configurations of these baseline methods. We tune specific hyper-parameters for ARL, including masking threshold in CISC, weight adjustment factor in WEC, optimizing them via grid search.' However, it does not specify any software names with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	We adopt hyper-parameters such as learning rate, batch size, and optimizer settings from the publicly released configurations of these baseline methods. We tune specific hyper-parameters for ARL, including masking threshold in CISC, weight adjustment factor in WEC, optimizing them via grid search. For MInt Rec, we use pre-extracted features with dimensions 768 for text, 256 for visual, and 768 for acoustic. For MOSI, feature dimensions are 768 for text, 20 for video, and 5 for audio. Applied every T epochs (see Fig. 2), WEC prevents long-term bias, enhancing generalization across training.