Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Towards the Causal Complete Cause of Multi-Modal Representation Learning
Authors: Jingyao Wang, Siyu Zhao, Wenwen Qiang, Jiangmeng Li, Changwen Zheng, Fuchun Sun, Hui Xiong
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we conduct extensive experiments on various benchmark datasets to verify the effectiveness of C3R. More details and experiments are provided in Appendix E-H. |
| Researcher Affiliation | Academia | 1Institute of Software Chinese Academy of Sciences, Beijing, China 2University of the Chinese Academy of Sciences, Beijing, China 3Tsinghua University, Beijing, China 4Thrust of Artificial Intelligence, The Hong Kong University of Science and Technology (Guangzhou), China Department of Computer Science and Engineering, The Hong Kong University of Science and Technology Hong Kong SAR, China. |
| Pseudocode | Yes | The pseudo-code of incorporating C3R into the MML models is shown in Algorithm 1. |
| Open Source Code | Yes | Codes can be found in our Github repository1. 1Codes can be found in https://github.com/ Wang Jingyao07/Multi-Modal-Base |
| Open Datasets | Yes | We select six datasets: (i) scenes recognition on NYU Depth V2 (Silberman et al., 2012) and SUN RGBD (Song et al., 2015) with RGB and depth images; (ii) imagetext classification on UPMC FOOD101 (Wang et al., 2015) and MVSA (Niu et al., 2016) with image and text; (iii) segmentation considering missing modalities on Bra TS (Menze et al., 2014; Bakas et al., 2018) with Flair, T1, T1c, and T2; and (iv) synthetic MMLSyn Data (see Appendix D.5). |
| Dataset Splits | Yes | For the training phase, we generate 1,000 multi-modal samples for training, while generating 200 samples for evaluation. |
| Hardware Specification | Yes | All experimental procedures are executed in five runs via NVIDIA RTX A6000 GPUs. |
| Software Dependencies | No | For the basic MML model, we follow the commonly used structure mentioned in (Zhang et al., 2023d; Xu et al., 2023) or the corresponding official code. For the model architecture of the causal representations learner, we use a three-layer Multilayer Perceptron (MLP) neural network with activation functions designed following (Clevert et al., 2016). The dimensions of the hidden vectors of each layer are specified as 64, 32, and 128. It can be embedded after the feature extractor of any MML model, ensuring the causal completeness of the learned representations by learning a learnable matrix based on Eq.10 that is consistent with the size of the representations. Moving on to the optimization process, we employ the Adam optimizer to train our model. |
| Experiment Setup | Yes | The hidden vector dimensions of each layer are specified as 64, 32, and 128, while the learned representation is 64. For optimization, we employ the Adam optimizer (Kingma & Ba, 2015) with Momentum and weight decay set at 0.8 and 10 4. The initial learning rate is established at 0.1, with the flexibility for linear scaling as required. Additionally, we use grid search to set the hyperparameters λv = 0.75 and λfe = 0.4. |