Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MM-Agent: LLM as Agents for Real-world Mathematical Modeling Problem

Authors: Fan LIU, Zherui Yang, Cancheng Liu, Tianrui Song, Xiaofeng Gao, Hao Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on MM-Bench show that MM-Agent significantly outperforms baseline agents, achieving an 11.88% improvement over human expert solutions while requiring only 15 minutes and $0.88 per task using GPT-4o. Furthermore, under official MCM/ICM protocols, MM-Agent assisted two undergraduate teams in winning the Finalist Award (top 2.0% among 27,456 teams) in MCM/ICM 2025, demonstrating its practical effectiveness as a modeling copilot.
Researcher Affiliation Academia Fan Liu 1, Zherui Yang 1, Cancheng Liu1, Tianrui Song1, Xiaofeng Gao2, Hao Liu 1 1AI Thrust, The Hong Kong University of Science and Technology (Guangzhou) 2Department of Computer Science and Engineering, Shanghai Jiao Tong University EMAIL; EMAIL; EMAIL; EMAIL;EMAIL
Pseudocode No The paper describes workflows and processes but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/usail-hkust/LLM-MM-Agent
Open Datasets Yes To enable systematic evaluation, we introduce MM-Bench, a new benchmark constructed from 111 real-world problems adapted from MCM/ICM, spanning the years 2000 to 2025. MM-Bench covers ten application domains (e.g., physics, biology, and economics) and eight modeling task types (e.g., decision-making, prediction, and evaluation). Each sample includes rich contextual components (e.g., textual descriptions, task goals, dataset information, and variable definitions) and requires agents to conduct problem interpretation, model formulation, and numerical reasoning in an integrated, end-to-end fashion. A detailed breakdown of task types and domain distribution is provided in Appendix B.
Dataset Splits Yes We select a subset of mathematical modeling problems from the past five years (2021 2025) as our test set, ensuring diversity across problem types and domains to support a representative evaluation. This subset consists of 32 problems in total. To mitigate potential data leakage from LLM pretraining, we evaluate problems from 2021 2024 separately from those in 2025.
Hardware Specification No The paper mentions using GPT-4o and Deepseek-R1 as base models and states that "All evaluations are conducted via official APIs provided by model vendors." This indicates that the experiments were run on external services, and the specific hardware used by these services is not detailed in the paper.
Software Dependencies No The paper mentions using GPT-4o and Deepseek-R1 as base models, but it does not specify versions for other software libraries, frameworks, or tools used in their own implementation, apart from referring to the MLE-Solver without a version number.
Experiment Setup Yes We select a subset of mathematical modeling problems from the past five years (2021 2025) as our test set, ensuring diversity across problem types and domains to support a representative evaluation. This subset consists of 32 problems in total. To mitigate potential data leakage from LLM pretraining, we evaluate problems from 2021 2024 separately from those in 2025. The LLM agents used in this evaluation include GPT-4o and Deepseek-R1 as base models. For the evaluation, we adopt both GPT-4o-based automatic scoring and human expert review, using a unified 1-to-10 scale. The selected human experts have previously earned at least an Honorable Mention in mathematical modeling competitions. Additional experimental details are provided in Appendix D.