Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MM-Agent: LLM as Agents for Real-world Mathematical Modeling Problem

Authors: Fan LIU, Zherui Yang, Cancheng Liu, Tianrui Song, Xiaofeng Gao, Hao Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on MM-Bench show that MM-Agent significantly outperforms baseline agents, achieving an 11.88% improvement over human expert solutions while requiring only 15 minutes and $0.88 per task using GPT-4o. Furthermore, under official MCM/ICM protocols, MM-Agent assisted two undergraduate teams in winning the Finalist Award (top 2.0% among 27,456 teams) in MCM/ICM 2025, demonstrating its practical effectiveness as a modeling copilot.
Researcher Affiliation	Academia	Fan Liu 1, Zherui Yang 1, Cancheng Liu1, Tianrui Song1, Xiaofeng Gao2, Hao Liu 1 1AI Thrust, The Hong Kong University of Science and Technology (Guangzhou) 2Department of Computer Science and Engineering, Shanghai Jiao Tong University EMAIL; EMAIL; EMAIL; EMAIL;EMAIL
Pseudocode	No	The paper describes workflows and processes but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/usail-hkust/LLM-MM-Agent
Open Datasets	Yes	To enable systematic evaluation, we introduce MM-Bench, a new benchmark constructed from 111 real-world problems adapted from MCM/ICM, spanning the years 2000 to 2025. MM-Bench covers ten application domains (e.g., physics, biology, and economics) and eight modeling task types (e.g., decision-making, prediction, and evaluation). Each sample includes rich contextual components (e.g., textual descriptions, task goals, dataset information, and variable definitions) and requires agents to conduct problem interpretation, model formulation, and numerical reasoning in an integrated, end-to-end fashion. A detailed breakdown of task types and domain distribution is provided in Appendix B.
Dataset Splits	Yes	We select a subset of mathematical modeling problems from the past five years (2021 2025) as our test set, ensuring diversity across problem types and domains to support a representative evaluation. This subset consists of 32 problems in total. To mitigate potential data leakage from LLM pretraining, we evaluate problems from 2021 2024 separately from those in 2025.
Hardware Specification	No	The paper mentions using GPT-4o and Deepseek-R1 as base models and states that "All evaluations are conducted via official APIs provided by model vendors." This indicates that the experiments were run on external services, and the specific hardware used by these services is not detailed in the paper.
Software Dependencies	No	The paper mentions using GPT-4o and Deepseek-R1 as base models, but it does not specify versions for other software libraries, frameworks, or tools used in their own implementation, apart from referring to the MLE-Solver without a version number.
Experiment Setup	Yes	We select a subset of mathematical modeling problems from the past five years (2021 2025) as our test set, ensuring diversity across problem types and domains to support a representative evaluation. This subset consists of 32 problems in total. To mitigate potential data leakage from LLM pretraining, we evaluate problems from 2021 2024 separately from those in 2025. The LLM agents used in this evaluation include GPT-4o and Deepseek-R1 as base models. For the evaluation, we adopt both GPT-4o-based automatic scoring and human expert review, using a unified 1-to-10 scale. The selected human experts have previously earned at least an Honorable Mention in mathematical modeling competitions. Additional experimental details are provided in Appendix D.