Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

OptiTree: Hierarchical Thoughts Generation with Tree Search for LLM Optimization Modeling

Authors: Haoyang Liu, Jie Wang, Yuyang Cai, Xiongwei Han, Yufei Kuang, Jianye Hao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that Opti Tree significantly improves the modeling accuracy compared to the state-of-the-art, achieving over 10% improvements on the challenging benchmarks. We consider seven modeling datasets to evaluate our method and the baselines. We evaluate Opti Tree against the competitive baselines on five modeling datasets using the Deep Seek V3 and GPT-4o models. The results, presented in Table 1, highlight three key findings.
Researcher Affiliation	Collaboration	Haoyang Liu1 , Jie Wang1 , Yuyang Cai1, Xiongwei Han2, Yufei Kuang1, Jianye Hao2,3 1Mo E Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China, 2 Noah s Ark Lab, Huawei Technologies, 3 Tianjin University
Pseudocode	Yes	Appendix D Algorithm of Opti Tree Algorithm 1 Tree Search for Subproblem Decomposition Algorithm 2 Modeling Tree Update
Open Source Code	Yes	The code is released at https://github.com/MIRALab-USTC/Opti Tree/tree/main.
Open Datasets	Yes	Dataset We consider seven modeling datasets to evaluate our method and the baselines. (1) NL4Opt [32] is from NL4Opt competition in Neur IPS 2022, composing 289 elementary-level linear programming problems. (2) MAMO Easy LP [20] contains 652 easy linear programming problems. (3) MAMO Complex LP [20] consists of 211 more complex optimization problems. (4) Complex OR [39] has 19 challenging problems derived from academic papers, textbooks, and real-world industry scenarios. (5) Industry OR [17] contains 100 real-world problems from eight industries with different difficulty levels: Easy, Medium and Hard. Finally, (6) Opti Bench [41] contains 605 problems, and (7) Opt MATH [25] dataset has 166 challenging problems.
Dataset Splits	No	Dataset We consider seven modeling datasets to evaluate our method and the baselines. [...] We construct the modeling tree using 400 randomly selected problems in the OR-Instruct dataset [17], which is part of the training dataset for ORLM and contains 3,000 problems. The paper does not specify the train/test/validation splits for the main evaluation datasets used across the benchmarks (e.g., NL4Opt, MAMO, Complex OR, Industry OR, Opti Bench, Opt MATH).
Hardware Specification	No	The paper discusses the time cost of the searching process and the execution time for the solver code, but it does not provide specific details about the CPU, GPU, memory, or other hardware used to run the experiments.
Software Dependencies	No	The Gurobi code examples in Appendix J.1 and other sections mention 'import gurobipy as gp from gurobipy import GRB', but no specific version number for Gurobi or any other software dependency is provided.
Experiment Setup	Yes	We construct the modeling tree using 400 randomly selected problems in the OR-Instruct dataset [17], which is part of the training dataset for ORLM and contains 3,000 problems. We evaluate Opti Tree against the competitive baselines on five modeling datasets using the Deep Seek V3 and GPT-4o models. [...] We implement two variants Opti Tree (depth=1) and Opti Tree (depth=3)-which limit the tree search depths to 1 or 3, respectively.