Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SolverLLM: Leveraging Test-Time Scaling for Optimization Problem via LLM-Guided Search

Authors: Dong Li, Xujiang Zhao, Linlin Yu, Yanchi Liu, Wei Cheng, Zhengzhang Chen, Zhong Chen, Feng Chen, Chen Zhao, Haifeng Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on six standard benchmark datasets demonstrate that Solver LLM outperforms both prompt-based and learning-based baselines, achieving strong generalization without additional training.
Researcher Affiliation Collaboration 1Baylor University 2NEC Labs America 3Augusta University 4Southern Illinois University 5University of Texas at Dallas
Pseudocode No The paper describes the MCTS algorithm steps (selection, dynamic expansion, simulation, and backpropagation) in sections 3.2.2-3.2.5 and illustrates them in Figure 2, but it does not provide a formal pseudocode or algorithm block.
Open Source Code Yes Answer: [Yes] Justification: The code is attached, and we are committed to releasing it publicly upon acceptance.
Open Datasets Yes For evaluation, we use the test set portions from six real-world optimization and operation task datasets: NL4Opt [19], Mamo (Easy LP and Complex LP) [10], NLP4LP [2], Complex OR [23], and Industry OR [22].
Dataset Splits Yes For evaluation, we use the test set portions from six real-world optimization and operation task datasets: NL4Opt [19], Mamo (Easy LP and Complex LP) [10], NLP4LP [2], Complex OR [23], and Industry OR [22]. These datasets include optimization problem cases of varying difficulty, types, and domains. Among them, the test set of NLP4LP is obtained by shuffling the source dataset and randomly sampling 100 cases. All other datasets use the same setting as LLMOPT [12].
Hardware Specification No This paper mainly relies on server APIs rather than local computation, so we do not report any compute resources.
Software Dependencies No The paper mentions using 'standard optimization solvers such as Gurobi or Pyomo' and generating Pyomo code, but it does not specify version numbers for Pyomo or Gurobi. It also lists various LLMs (GPT-4, GPT-4o, Mistral-7B, Deepseek Math-7B-Base, LLaMa3-8B, Qwen1.5-14B) but these are referred to by name rather than specific software versions.
Experiment Setup Yes The final hyperparameter settings are shown in Table 6. Table 6: Hyperparameter configuration. Maximum number of components per expansion 3 Maximum number of nodes per layer 5 Maximum number of search iterations 20 Exploration weight of UCT c 2 Reward weight α, β, γ 0.1, 0.8, 0.1 Local uncertainty threshold η 0.3 LLM temperature 0.2