Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MuSLR: Multimodal Symbolic Logical Reasoning

Authors: Jundong Xu, Hao Fei, Yuhui Zhang, Liangming Pan, Qijun Huang, Qian Liu, Preslav Nakov, Min-Yen Kan, William Yang Wang, Mong-Li Lee, Wynne Hsu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate 7 state-of-the-art VLMs on our benchmark and find that they all struggle with multimodal symbolic reasoning, with the best model, GPT-4.1, achieving only 46.8%. Thus, we propose Logi CAM, a modular framework that applies formal logical rules to multimodal inputs, boosting GPT-4.1 s Chain-of-Thought performance by 14.13%, and delivering even larger gains on complex logics such as first-order logic. We also conduct a comprehensive error analysis, showing that around 70% of failures stem from logical misalignment between modalities, offering key insights to guide future improvements. All data and code are publicly available at https://llm-symbol.github.io/Mu SLR.
Researcher Affiliation	Academia	National University of Singapore, 2 Stanford University, 3 Peking University, 4 Uni Melb, 5 University of Auckland, 6 MBZUAI, 7 University of California, Santa Barbara EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Figure 4: Logi CAM Workflow. The figure illustrates a single iteration; the complete multi-iteration reasoning process is detailed in Section 9. 5 Logi CAM: A Modular Mu SLR Framework We propose a modular framework, Logi CAM (Logical reasoning with Commonsense Augmentation with Multimodality), which consists of three modules based on GPT-4.1, as illustrated in Figure 4. Each module is designed to address a specific challenge posed by Mu SLR. The modules work together to solve different problem components, which include: (1) the Premise Selector, (2) the Reasoning Type Identifier, and (3) the Reasoner module. Below, we explain how each module addresses its challenge and contributes to the reasoning chain.
Open Source Code	Yes	All data and code are publicly available at https://llm-symbol.github.io/Mu SLR.
Open Datasets	Yes	All data and code are publicly available at https://llm-symbol.github.io/Mu SLR. 4 Mu SLR-Bench: A Benchmark for Multimodal Symbolic Logical Reasoning Dataset Construction. We collect images from various sources such as COCO [14], Flickr30k [25], nocaps [1], Mimic [10], RVL_CDIP [8], Science QA [17], and manually collected Traffic Reports.
Dataset Splits	No	The paper does not explicitly provide training/test/validation dataset splits, their percentages, or sample counts. It describes the dataset size as 1093 instances and mentions evaluating models but does not detail how these instances are partitioned for training, validation, and testing.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory amounts) used to run its experiments in the main text or the experiment settings section.
Software Dependencies	No	The paper mentions using models like GPT-4o, Qwen2.5-VL-7B-Instruct, Llava-1.5-7B, Intern VL3-8B, Instructblip-Vicuna-13B, GPT-4.1, and Claude-3.7-Sonnet. It also references Vera [16] (a T5 model). However, it does not provide specific version numbers for underlying software dependencies such as programming languages, libraries (e.g., PyTorch, TensorFlow), or CUDA versions.
Experiment Setup	Yes	Settings. To ensure reproducibility, all models are evaluated under standardized settings. We adopt a three-shot Chain-of-Thought (Co T) [32] prompting setup. For language model sampling, the temperature is set to 0.0 to minimize randomness and encourage deterministic outputs.