Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards the Causal Complete Cause of Multi-Modal Representation Learning

Authors: Jingyao Wang, Siyu Zhao, Wenwen Qiang, Jiangmeng Li, Changwen Zheng, Fuchun Sun, Hui Xiong

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we conduct extensive experiments on various benchmark datasets to verify the effectiveness of C3R. More details and experiments are provided in Appendix E-H.
Researcher Affiliation Academia 1Institute of Software Chinese Academy of Sciences, Beijing, China 2University of the Chinese Academy of Sciences, Beijing, China 3Tsinghua University, Beijing, China 4Thrust of Artificial Intelligence, The Hong Kong University of Science and Technology (Guangzhou), China Department of Computer Science and Engineering, The Hong Kong University of Science and Technology Hong Kong SAR, China.
Pseudocode Yes The pseudo-code of incorporating C3R into the MML models is shown in Algorithm 1.
Open Source Code Yes Codes can be found in our Github repository1. 1Codes can be found in https://github.com/ Wang Jingyao07/Multi-Modal-Base
Open Datasets Yes We select six datasets: (i) scenes recognition on NYU Depth V2 (Silberman et al., 2012) and SUN RGBD (Song et al., 2015) with RGB and depth images; (ii) imagetext classification on UPMC FOOD101 (Wang et al., 2015) and MVSA (Niu et al., 2016) with image and text; (iii) segmentation considering missing modalities on Bra TS (Menze et al., 2014; Bakas et al., 2018) with Flair, T1, T1c, and T2; and (iv) synthetic MMLSyn Data (see Appendix D.5).
Dataset Splits Yes For the training phase, we generate 1,000 multi-modal samples for training, while generating 200 samples for evaluation.
Hardware Specification Yes All experimental procedures are executed in five runs via NVIDIA RTX A6000 GPUs.
Software Dependencies No For the basic MML model, we follow the commonly used structure mentioned in (Zhang et al., 2023d; Xu et al., 2023) or the corresponding official code. For the model architecture of the causal representations learner, we use a three-layer Multilayer Perceptron (MLP) neural network with activation functions designed following (Clevert et al., 2016). The dimensions of the hidden vectors of each layer are specified as 64, 32, and 128. It can be embedded after the feature extractor of any MML model, ensuring the causal completeness of the learned representations by learning a learnable matrix based on Eq.10 that is consistent with the size of the representations. Moving on to the optimization process, we employ the Adam optimizer to train our model.
Experiment Setup Yes The hidden vector dimensions of each layer are specified as 64, 32, and 128, while the learned representation is 64. For optimization, we employ the Adam optimizer (Kingma & Ba, 2015) with Momentum and weight decay set at 0.8 and 10 4. The initial learning rate is established at 0.1, with the flexibility for linear scaling as required. Additionally, we use grid search to set the hyperparameters λv = 0.75 and λfe = 0.4.