Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Revealing Multimodal Causality with Large Language Models
Authors: Jin Li, Shoujin Wang, Qi Zhang, Feng Liu, Tongliang Liu, Longbing Cao, Shui Yu, Fang Chen
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed MLLM-CD in revealing genuine factors and causal relationships among them from multimodal unstructured data. The implementation code and data are available at https: //github.com/Jin Li-i/MLLM-CD. We evaluate both the causal factor and structure discovery performance of MLLM-CD on both synthetic and real-world multimodal datasets, based on state-of-the-art multimodal LLMs including GPT-4o [86], Gemini 2.0 [87], LLa MA 4 Maverick [88], and Grok-2v [89]. |
| Researcher Affiliation | Academia | Jin Li1, Shoujin Wang1 , Qi Zhang2, Feng Liu3, Tongliang Liu4, Longbing Cao5, Shui Yu1, Fang Chen1 1University of Technology Sydney 2Tongji University 3University of Melbourne 4University of Sydney 5Macquarie University EMAIL EMAIL EMAIL EMAIL EMAIL EMAIL |
| Pseudocode | Yes | C Algorithm Algorithm 1 The MLLM-CD Framework |
| Open Source Code | Yes | The implementation code and data are available at https: //github.com/Jin Li-i/MLLM-CD. |
| Open Datasets | Yes | We construct two multimodal datasets for evaluation: (1) Multimodal Apple Gastronome (MAG) dataset, which is a synthetic dataset with 200 samples with 9 high-level factors, and (2) Lung Cancer dataset, which is a real-world dataset with 60 samples with 5 high-level factors. Please refer to Appendix D.3 for more details on the experimental settings and environment information. This dataset will be open-sourced under CC-BY 4.0. Lung Cancer: This is a real-world dataset collected from the Med Pix database 5 under Open Database License. We select 60 representative lung cancer cases (e.g., Non-Small Cell Lung Cancer [91]). |
| Dataset Splits | No | The paper mentions using "200 samples" for MAG dataset and "60 samples" for Lung Cancer dataset, but does not specify how these samples are split into training, validation, or test sets for the experiments. |
| Hardware Specification | Yes | all experiments are conducted on a server with two Intel Xeon 6346 CPUs, 256GB RAM, and two NVIDIA A40 GPUs. |
| Software Dependencies | No | We use the FCI implementation from the causal-learn library [12], available at the website 6. (It mentions the library but not its specific version number.) |
| Experiment Setup | Yes | In the contrastive factor discovery module, we use the pretrained CLIP model [75] with the Vi TB/32 checkpoint from Open AI s official release to extract textual and visual embeddings from the multimodal samples in the MAG and Lung Cancer datasets. For intraand inter-modal contrastive exploration, we choose the top K = 5 pairs of samples with the prompts in Section D.2 for factor identification and annotation. In the causal structure discovery module, we adopt the FCI algorithm [31] to infer the causal structure from the annotated factors. Additional discussion on different CD methods can be found in Section D.7. We use the FCI implementation from the causal-learn library [12], available at the website 6. In the multimodal counterfactual reasoning module, we set the threshold parameters as τsem = 0.7 and τcausal = 0.4 for consistency validation. ϵ is a small constant set to 10 6 to highlight any changes in the non-descendant nodes. Following [20], the maximum number of iterations is set to T = 3. Further analysis of parameter choices is discussed in Section D.8. |