Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SolverLLM: Leveraging Test-Time Scaling for Optimization Problem via LLM-Guided Search

Authors: Dong Li, Xujiang Zhao, Linlin Yu, Yanchi Liu, Wei Cheng, Zhengzhang Chen, Zhong Chen, Feng Chen, Chen Zhao, Haifeng Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on six standard benchmark datasets demonstrate that Solver LLM outperforms both prompt-based and learning-based baselines, achieving strong generalization without additional training.
Researcher Affiliation	Collaboration	1Baylor University 2NEC Labs America 3Augusta University 4Southern Illinois University 5University of Texas at Dallas
Pseudocode	No	The paper describes the MCTS algorithm steps (selection, dynamic expansion, simulation, and backpropagation) in sections 3.2.2-3.2.5 and illustrates them in Figure 2, but it does not provide a formal pseudocode or algorithm block.
Open Source Code	Yes	Answer: [Yes] Justification: The code is attached, and we are committed to releasing it publicly upon acceptance.
Open Datasets	Yes	For evaluation, we use the test set portions from six real-world optimization and operation task datasets: NL4Opt [19], Mamo (Easy LP and Complex LP) [10], NLP4LP [2], Complex OR [23], and Industry OR [22].
Dataset Splits	Yes	For evaluation, we use the test set portions from six real-world optimization and operation task datasets: NL4Opt [19], Mamo (Easy LP and Complex LP) [10], NLP4LP [2], Complex OR [23], and Industry OR [22]. These datasets include optimization problem cases of varying difficulty, types, and domains. Among them, the test set of NLP4LP is obtained by shuffling the source dataset and randomly sampling 100 cases. All other datasets use the same setting as LLMOPT [12].
Hardware Specification	No	This paper mainly relies on server APIs rather than local computation, so we do not report any compute resources.
Software Dependencies	No	The paper mentions using 'standard optimization solvers such as Gurobi or Pyomo' and generating Pyomo code, but it does not specify version numbers for Pyomo or Gurobi. It also lists various LLMs (GPT-4, GPT-4o, Mistral-7B, Deepseek Math-7B-Base, LLaMa3-8B, Qwen1.5-14B) but these are referred to by name rather than specific software versions.
Experiment Setup	Yes	The final hyperparameter settings are shown in Table 6. Table 6: Hyperparameter configuration. Maximum number of components per expansion 3 Maximum number of nodes per layer 5 Maximum number of search iterations 20 Exploration weight of UCT c 2 Reward weight α, β, γ 0.1, 0.8, 0.1 Local uncertainty threshold η 0.3 LLM temperature 0.2