Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MuSLR: Multimodal Symbolic Logical Reasoning
Authors: Jundong Xu, Hao Fei, Yuhui Zhang, Liangming Pan, Qijun Huang, Qian Liu, Preslav Nakov, Min-Yen Kan, William Yang Wang, Mong-Li Lee, Wynne Hsu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate 7 state-of-the-art VLMs on our benchmark and find that they all struggle with multimodal symbolic reasoning, with the best model, GPT-4.1, achieving only 46.8%. Thus, we propose Logi CAM, a modular framework that applies formal logical rules to multimodal inputs, boosting GPT-4.1 s Chain-of-Thought performance by 14.13%, and delivering even larger gains on complex logics such as first-order logic. We also conduct a comprehensive error analysis, showing that around 70% of failures stem from logical misalignment between modalities, offering key insights to guide future improvements. All data and code are publicly available at https://llm-symbol.github.io/Mu SLR. |
| Researcher Affiliation | Academia | National University of Singapore, 2 Stanford University, 3 Peking University, 4 Uni Melb, 5 University of Auckland, 6 MBZUAI, 7 University of California, Santa Barbara EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Figure 4: Logi CAM Workflow. The figure illustrates a single iteration; the complete multi-iteration reasoning process is detailed in Section 9. 5 Logi CAM: A Modular Mu SLR Framework We propose a modular framework, Logi CAM (Logical reasoning with Commonsense Augmentation with Multimodality), which consists of three modules based on GPT-4.1, as illustrated in Figure 4. Each module is designed to address a specific challenge posed by Mu SLR. The modules work together to solve different problem components, which include: (1) the Premise Selector, (2) the Reasoning Type Identifier, and (3) the Reasoner module. Below, we explain how each module addresses its challenge and contributes to the reasoning chain. |
| Open Source Code | Yes | All data and code are publicly available at https://llm-symbol.github.io/Mu SLR. |
| Open Datasets | Yes | All data and code are publicly available at https://llm-symbol.github.io/Mu SLR. 4 Mu SLR-Bench: A Benchmark for Multimodal Symbolic Logical Reasoning Dataset Construction. We collect images from various sources such as COCO [14], Flickr30k [25], nocaps [1], Mimic [10], RVL_CDIP [8], Science QA [17], and manually collected Traffic Reports. |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits, their percentages, or sample counts. It describes the dataset size as 1093 instances and mentions evaluating models but does not detail how these instances are partitioned for training, validation, and testing. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory amounts) used to run its experiments in the main text or the experiment settings section. |
| Software Dependencies | No | The paper mentions using models like GPT-4o, Qwen2.5-VL-7B-Instruct, Llava-1.5-7B, Intern VL3-8B, Instructblip-Vicuna-13B, GPT-4.1, and Claude-3.7-Sonnet. It also references Vera [16] (a T5 model). However, it does not provide specific version numbers for underlying software dependencies such as programming languages, libraries (e.g., PyTorch, TensorFlow), or CUDA versions. |
| Experiment Setup | Yes | Settings. To ensure reproducibility, all models are evaluated under standardized settings. We adopt a three-shot Chain-of-Thought (Co T) [32] prompting setup. For language model sampling, the temperature is set to 0.0 to minimize randomness and encourage deterministic outputs. |