Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MM-Agent: LLM as Agents for Real-world Mathematical Modeling Problem
Authors: Fan LIU, Zherui Yang, Cancheng Liu, Tianrui Song, Xiaofeng Gao, Hao Liu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on MM-Bench show that MM-Agent significantly outperforms baseline agents, achieving an 11.88% improvement over human expert solutions while requiring only 15 minutes and $0.88 per task using GPT-4o. Furthermore, under official MCM/ICM protocols, MM-Agent assisted two undergraduate teams in winning the Finalist Award (top 2.0% among 27,456 teams) in MCM/ICM 2025, demonstrating its practical effectiveness as a modeling copilot. |
| Researcher Affiliation | Academia | Fan Liu 1, Zherui Yang 1, Cancheng Liu1, Tianrui Song1, Xiaofeng Gao2, Hao Liu 1 1AI Thrust, The Hong Kong University of Science and Technology (Guangzhou) 2Department of Computer Science and Engineering, Shanghai Jiao Tong University EMAIL; EMAIL; EMAIL; EMAIL;EMAIL |
| Pseudocode | No | The paper describes workflows and processes but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/usail-hkust/LLM-MM-Agent |
| Open Datasets | Yes | To enable systematic evaluation, we introduce MM-Bench, a new benchmark constructed from 111 real-world problems adapted from MCM/ICM, spanning the years 2000 to 2025. MM-Bench covers ten application domains (e.g., physics, biology, and economics) and eight modeling task types (e.g., decision-making, prediction, and evaluation). Each sample includes rich contextual components (e.g., textual descriptions, task goals, dataset information, and variable definitions) and requires agents to conduct problem interpretation, model formulation, and numerical reasoning in an integrated, end-to-end fashion. A detailed breakdown of task types and domain distribution is provided in Appendix B. |
| Dataset Splits | Yes | We select a subset of mathematical modeling problems from the past five years (2021 2025) as our test set, ensuring diversity across problem types and domains to support a representative evaluation. This subset consists of 32 problems in total. To mitigate potential data leakage from LLM pretraining, we evaluate problems from 2021 2024 separately from those in 2025. |
| Hardware Specification | No | The paper mentions using GPT-4o and Deepseek-R1 as base models and states that "All evaluations are conducted via official APIs provided by model vendors." This indicates that the experiments were run on external services, and the specific hardware used by these services is not detailed in the paper. |
| Software Dependencies | No | The paper mentions using GPT-4o and Deepseek-R1 as base models, but it does not specify versions for other software libraries, frameworks, or tools used in their own implementation, apart from referring to the MLE-Solver without a version number. |
| Experiment Setup | Yes | We select a subset of mathematical modeling problems from the past five years (2021 2025) as our test set, ensuring diversity across problem types and domains to support a representative evaluation. This subset consists of 32 problems in total. To mitigate potential data leakage from LLM pretraining, we evaluate problems from 2021 2024 separately from those in 2025. The LLM agents used in this evaluation include GPT-4o and Deepseek-R1 as base models. For the evaluation, we adopt both GPT-4o-based automatic scoring and human expert review, using a unified 1-to-10 scale. The selected human experts have previously earned at least an Honorable Mention in mathematical modeling competitions. Additional experimental details are provided in Appendix D. |