Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
On the Value of Cross-Modal Misalignment in Multimodal Representation Learning
Authors: Yichao Cai, Yuhang Liu, Erdun Gao, Tianjiao Jiang, Zhen Zhang, Anton van den Hengel, Prof Javen Qinfeng Shi
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our theoretical findings via extensive empirical studies on both synthetic data and real image-text datasets, shedding light on the nuanced impact of cross-modal misalignment on multimodal representation learning. 2 1 Introduction Modern multimodal learning has achieved remarkable success in jointly modeling information from heterogeneous sources such as vision, language, and audio. 5 Experiments We conduct extensive experiments to validate our theoretical results, including numerical simulations ( 5.1), a real-world image-text dataset with independent semantic variables ( 5.2), a synthetic dataset with dependent semantic variables. ( 5.3), and a case study with Open CLIP models ( 5.4). |
| Researcher Affiliation | Academia | Yichao Cai Yuhang Liu Erdun Gao Tianjiao Jiang Zhen Zhang Anton van den Hengel Javen Qinfeng Shi Australian Institute for Machine Learning The University of Adelaide, SA 5000, Australia Equal contribution. Correspondence to: EMAIL. |
| Pseudocode | No | The paper describes the methodology using mathematical formulations and textual explanations but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The code is available at https://anonymous.4open.science/r/ crossmodal_mislaignment-4A3B. |
| Open Datasets | Yes | We validate our theoretical findings via extensive empirical studies on both synthetic data and real image-text datasets 5.2 MPI3D-Complex: Real-World Dataset with Factorized Latent Variables 5.3 Causal3DIdent: Semi-Synthetic Dataset with Structured Causal Latent Variables 5.4 Case Study: Zero-Shot Evaluation of Open CLIP Model Representations We further validate our theoretical findings through a comprehensive zero-shot evaluation case study on Open CLIP models [24], a foundation MMCL model pretrained on the LAION-400M dataset [61]. |
| Dataset Splits | Yes | D.1 Detailed Experimental Setup Downstream tasks design. To evaluate pretrained representations under various bias conditions, we construct several downstream tasks. Specifically, four regression tasks are created by generating labels using complex nonlinear functions fyi, each applied to different subsets of the true semantic variables: y1 = fy1(s[3]), y2 = fy2(s[5]), y3 = fy3(s[7]), y4 = fy4(s[9]). For both downstream tasks, we fix the pretrained encoders and evaluate the quality of the learned representations using a two-layer MLP as a probing model. We generate 20,480 samples as the evaluation set for training the regressors and classifiers, along with an additional 20,480 samples as the in-distribution test set. To assess OOD generalization, we generate another 20,480 samples from the shifted latent space as the OOD test set. E.1 Detailed Experimental Setup Training details. For each setting, the dataset is partitioned into training, evaluation, and test subsets in a fixed ratio of 44,720 : 23,040 : 23,040. F.1 Detailed Experimental Setup Image latent factors and image generation. Following prior work [83, 68, 13, 78], we utilize the Causal3DIdent dataset to synthesize images from a predefined latent causal structure. Images are generated using the Blender renderer [9], which applies a complex rendering function parameterized by 11 input variables. In our configuration, the object s z-position is fixed, leaving 10 latent factors that govern image generation. We synthesize 80,000 samples for MMCL training, 10,000 samples for classifier or regressor training, and another 10,000 samples for test-time evaluation. |
| Hardware Specification | Yes | H Computation Resources All experiments were conducted on a high-performance computing cluster equipped with 4 NVIDIA A100 GPUs (40 GB each), running CUDA 12.2 and driver version 535.161.07. The system also included an AMD EPYC 7313 16-core processor and 503 GB of RAM. For the numerical simulations, we trained over 120 models in total, requiring approximately 70 GPU-hours across 4 GPUs. On the MPI3D-Complex dataset, we trained 36 models, consuming approximately 27 GPU-hours. For the Causal3DIdent dataset, we trained 42 models, which required roughly 25 GPU-hours across 4 GPUs. Additionally, we generated 100,000 synthetic images for the Causal3DIdent dataset using Blender. Rendering was performed over four days on a separate workstation equipped with an AMD Ryzen 7 7700X 8-core processor (4.50 GHz) and a single NVIDIA RTX 4090 GPU (24 GB). |
| Software Dependencies | Yes | H Computation Resources All experiments were conducted on a high-performance computing cluster equipped with 4 NVIDIA A100 GPUs (40 GB each), running CUDA 12.2 and driver version 535.161.07. |
| Experiment Setup | Yes | D.1 Detailed Experimental Setup Parameter settings. We parameterize the generative functions gx and gt(θ) using randomly initialized 3-layer invertible MLPs, following prior work [13]. Invertibility is enforced by maintaining a condition number threshold of 1e 3 for each layer. The encoding functions fx and ft are implemented as 7-layer MLPs and optimized using the Adam optimizer. For MMCL training, we use a batch size of 6144, a learning rate of 1e 4, and train for 100,000 iterations. The loss function is given by Eq. (2), with Euclidean distance used as the similarity metric and a temperature parameter set to 1.0. To ensure training stability, gradients are clipped using a maximum 2-norm of 2. |