Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Large Language Models as End-to-end Combinatorial Optimization Solvers

Authors: Xia Jiang, Yaoxin Wu, Minshuo Li, Zhiguang Cao, Yingqian Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluation across seven NP-hard CO problems shows that our method achieves a high feasibility rate and reduces the average optimality gap to 1.03 8.20% by tuning a 7B-parameter LLM, surpassing both general-purpose LLMs (e.g., GPT-4o), reasoning models (e.g., Deep Seek-R1), and domain-specific heuristics.
Researcher Affiliation	Academia	Xia Jiang Eindhoven University of Technology EMAIL Yaoxin Wu Eindhoven University of Technology EMAIL Minshuo Li Eindhoven University of Technology EMAIL Zhiguang Cao Singapore Management University EMAIL Yingqian Zhang Eindhoven University of Technology EMAIL
Pseudocode	No	The paper describes methods like Supervised Fine-tuning (SFT) and Feasibility-and-optimality-aware Reinforcement Learning (FOARL) using mathematical equations and textual explanations, but it does not include any explicitly labeled pseudocode or algorithm blocks. For example, Section 4.3 describes the FOARL algorithm but not in pseudocode format.
Open Source Code	Yes	More details of the baselines are elaborated in Appendix D. Our code and data are publicly available at https://github.com/Summer142857/LLMCo Solver.
Open Datasets	Yes	To further demonstrate the generalizability, we evaluate the fine-tuned JSSP solver on Taillard (TA) benchmark [65]. ... We also compare our method with other LLM-based methods on TSPLib, and the result is presented in Appendix E.11.
Dataset Splits	Yes	For SFT, we generate 500,000 instances per CO problem, and an additional set with at most 3,200 instances is used for FOARL. ... During evaluation, we use 100 randomly generated instances (following the instance generation process specified in Appendix A) for each COP.
Hardware Specification	Yes	All experiments are conducted on a server equipped with an AMD EPYC 7F72 CPU (3.2 GHz) and an NVIDIA H100 GPU.
Software Dependencies	No	Appendix G lists several software resources used (OR-Tools, LKH-3, Gurobi, pyCombinatorial, Compass, Unsloth, Deep ACO, OPRO, SGE, Re Evo, MCTS-AHD) but does not provide specific version numbers for these software components. For example, it mentions 'OR-Tools Code https://github.com/google/or-tools Apache-2.0 License' without a version.
Experiment Setup	Yes	The LLMs are fine-tuned with a context length of 20,000 tokens. Both the Lo RA rank and scaling factor are set to 64 for parameter-efficient fine-tuning. For the SFT process, we use a batch size of 4, with a gradient accumulation step of 4, resulting in an effective batch size of 16. Optimization is performed using the Adam W optimizer, with a learning rate of 2 10 4, and a linear decay scheduler with a decay rate of 0.01. For the FOARL process, we set the hyperparameters ϵ = 0.1 and β = 0.05. The batch size is set to 8, while S = 8 generations are produced for each instance to calculate the group advantage Ai. We set the weighting parameter α = 1 to balance the optimality and feasibility of the generated solutions. The learning rate for reinforcement learning is set to 1 10 6, and the rest of the parameters are the same as the SFT process.