Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

InfMasking: Unleashing Synergistic Information by Contrastive Multimodal Interactions

Authors: Liangjian Wen, Qun Dai, Jianzhuang Liu, Jiangtao Zheng, Yong Dai, Dongkai Wang, Zhao Kang, Jun Wang, Zenglin Xu, Jiang Duan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In evaluations on large-scale real-world datasets, Inf Masking achieves state-of-the-art performance across seven benchmarks.
Researcher Affiliation Collaboration 1School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics 2Engineering Research Center of Intelligent Finance, Ministry of Education,Southwestern University of Finance and Economics 3Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences 4X-Humanoid 5University of Electronic Science and Technology of China 6Shanghai Academy of AI for Science 7Artificial Intelligence Innovation and Incubation Institute, Fudan University 8Artificial Intelligence and Digital Finance Key Laboratory of Sichuan Province
Pseudocode Yes Algorithm 1 outlines the training procedure of Inf Masking, formulated in the general case with n modalities (e.g., image, text, audio, etc).
Open Source Code Yes Code is released at https://github.com/brightest66/Inf Masking.
Open Datasets Yes We perform experiments on both synthetic benchmarks and multiple large-scale real-world datasets to verify the effectiveness of Inf Masking in learning representations from diverse modalities. To evaluate Inf Masking s capacity to capture three essential aspects of multimodal interactions (i.e., uniqueness, redundancy, and synergy), we generate synthetic data in a controlled environment based on the Trifeature dataset [22]. Furthermore, we assess the generalizability of Inf Masking on several widely used multimodal benchmark datasets involving diverse modality combinations in real-world scenarios. These tasks span various domains (e.g., healthcare, robotics, etc.) allowing for a thorough assessment of the model s representation capabilities across diverse modalities. Detailed experimental settings are provided in Appendix A.
Dataset Splits Yes The Trifeature dataset [22] is designed to investigate the properties of visual neural networks and comprises three distinct features: shape, color, and texture. Each feature consists of 10 categories, resulting in 1,000 unique combinations. Of these, 800 are used for training and 200 for testing. Each training combination is instantiated three times with random rotations applied to both shape and texture components. Shapes are rendered within a 128 128 bounding box, with rotation angles uniformly sampled from [ 45 , 45 ], and then randomly placed within a 224 224 image canvas while ensuring full visibility. Texture and color are independently applied in the same manner. Image pairs are constructed from these instances, resulting in 10,000 training pairs and 4,096 test pairs, both sampled from the same underlying distribution.
Hardware Specification Yes All experiments are conducted on a single NVIDIA 4090 GPU with 24GB memory.
Software Dependencies No We use Adam W [33] as the optimizer in all experiments. Detailed hyperparameters are listed in Tab. 6. Following [12] on MM-IMDb, we also use a cosine scheduler with final value 10 6 and a warmup over 10 epochs. And all models are trained for 100 epochs except for MM-IMDb which is trained for 70 epochs. All experiments are conducted on a single NVIDIA 4090 GPU with 24GB memory.
Experiment Setup Yes Training protocol. All experiments are conducted using five independent runs with random seeds in the range [42, 46]. We report the mean and standard deviation of performance metrics (i.e., accuracy, mean squared error) to account for variability across runs. Early stopping based on validation accuracy is systematically applied to prevent overfitting. The best-performing checkpoint on the validation set is selected for final evaluation on the test set. For dataset-specific encoder architectures, modality-specific data augmentation and latent converters, we follow the same configurations as Co MM [12]. Training details. We use Adam W [33] as the optimizer in all experiments. Detailed hyperparameters are listed in Tab. 6. Following [12] on MM-IMDb, we also use a cosine scheduler with final value 10 6 and a warmup over 10 epochs. And all models are trained for 100 epochs except for MM-IMDb which is trained for 70 epochs. All experiments are conducted on a single NVIDIA 4090 GPU with 24GB memory.