Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints

Authors: Rui Xu, Dakuan Lu, Zicheng Zhao, Xiaoyu Tan, Xintao Wang, Siyu Yuan, Jiangjie Chen, yinghui xu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper introduces ORIGAMISPACE, a new dataset and benchmark designed to evaluate the multi-step spatial reasoning ability and the capacity to handle mathematical constraints of MLLMs through origami tasks. The dataset contains 350 data instances... Through experiments on existing MLLMs, we initially reveal the strengths and weaknesses of these models in handling complex spatial reasoning tasks.
Researcher Affiliation Collaboration Rui Xu1,2,3 Dakuan Lu3 Zicheng Zhao1 Xiaoyu Tan3 Xintao Wang1 Siyu Yuan1 Jiangjie Chen1 Yinghui Xu1,3 1Fudan University 2SII 3INF Technology
Pseudocode No The paper describes evaluation processes and system logic in text and functional descriptions (e.g., in Appendix D: Crease Pattern evaluation system), but it does not present these as structured pseudocode or algorithm blocks with formal labels like "Algorithm".
Open Source Code Yes We provide complete data, evaluation code, and model training code, which can be accessed via Git Hub in the public version.
Open Datasets Yes This paper introduces ORIGAMISPACE, a new dataset and benchmark designed to evaluate... The dataset contains 350 data instances... All our data are public data or authorized by the original websites and data sources, with no potential infringement risks.
Dataset Splits No We collect 350 sets of origami data... In addition to this part of the data, we also collect 471 groups of data without intermediate folding processes for the subsequent training of the model.
Hardware Specification Yes Specifically, we trained for 10.2 hours on 16 H100 GPUs, with the following hyperparameter settings: γturn = 0.95, γtoken = 1.0, KL penalty = 0.001, Actor LR=1 10 6, and Critic LR=1 10 5.
Software Dependencies No For the reinforcement learning method, we adopt TRICO [35] for training on qwen2.5-vl-32B, which is a PPO-based [36], more efficient MLLMs multi-turn reinforcement learning algorithm.
Experiment Setup Yes Specifically, we trained for 10.2 hours on 16 H100 GPUs, with the following hyperparameter settings: γturn = 0.95, γtoken = 1.0, KL penalty = 0.001, Actor LR=1 10 6, and Critic LR=1 10 5.