Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Solver-Informed RL: Grounding Large Language Models for Authentic Optimization Modeling

Authors: Yitian Chen, Jingfan Xia, Siyu Shao, DongDong Ge, Yinyu Ye

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on diverse public benchmarks demonstrate that models trained with our SIRL framework achieve state-of-the-art performance, substantially outperforming existing methods in generating accurate and executable optimization models. Specifically, our SIRL-32B model surpasses Deep Seek-V3 and Open AI-o3 on the majority of these benchmarks. Our code is publicly available at https://github.com/Cardinal-Operations/SIRL.
Researcher Affiliation Collaboration Yitian Chen1 , Jingfan Xia2,1 , Siyu Shao1,3, Dongdong Ge4 , Yinyu Ye4,5 1 Cardinal Operations, China 2 Shanghai University of Finance and Economics 3 The University of Hong Kong 4 Antai School of Economics and Management, Shanghai Jiao Tong University 5 Department of Management Science and Engineering, Stanford University EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the SIRL framework, its components, and training procedures in narrative text. While it references algorithms like REINFORCE++ and discusses structured prompts, it does not contain a dedicated pseudocode block, algorithm box, or a clearly labeled 'Algorithm X' with step-by-step instructions in a code-like format for its own methodology.
Open Source Code Yes Our code is publicly available at https://github.com/Cardinal-Operations/SIRL.
Open Datasets Yes We evaluated the performance of our trained models on four key optimization modeling datasets: NL4OPT [82], MAMO [83], Industry OR [25], and Opt MATH [26]. ... For the benefit of future research, a comprehensive description detailing our correction methodology, the rationale for each modification, and the revised datasets are publicly available at the Github repository https://github.com/Cardinal Operations/SIRL/blob/main/test_data/README.md.
Dataset Splits Yes Starting from the synthetic dataset, we applied a filtering strategy guided by the principle Less is More [73, 74]. Specifically, we excluded (question, answer) pairs if the baseline Qwen-32BInstruct model [22] achieved an 80% success rate (8/10 attempts across different prompting roles) in generating executable code matching the ground-truth optimal value, as such samples were deemed too trivial. This process yielded approximately 70,000 samples. From this set, we then randomly sampled 10,000 instances to form our training data. We evaluated the performance of our trained models on four key optimization modeling datasets: NL4OPT [82], MAMO [83], Industry OR [25], and Opt MATH [26] using pass@1 accuracy.
Hardware Specification Yes All experiments for the 7B model were conducted on a single compute node equipped with eight 80GB NVIDIA H100 GPUs.
Software Dependencies No We used Qwen2.5-7B-Instruct [22] as the base model and adapted the Verl framework [85] for reinforcement learning training, modifying its implementation to incorporate our novel surrogate function design with the Partial KL strategy and two-stage reward mechanism. The example code in Appendix A.2 uses 'gurobipy', but no specific version for it or Python is mentioned.
Experiment Setup Yes The key hyperparameters for SIRL training are detailed in Table 6: (Algorithm: reinforce_plus_plus, Data Batch size: 128, Learning rate: 1e-6, Max prompt length: 2048, Max response length: 8192, Truncation: left, Actor/Rollout KL loss type: low_var_kl, KL loss coefficient: 0.005, Rollout number: 8, PPO mini batch size: 8, PPO micro batch Size per GPU: 4, Clip ratio low: 0.20, Clip ratio high: 0.28). The exact sampling hyperparameters used to generate our main results are specified in Table 7: (n: 1, Temperature: 0.5, Top p: 0.9, Max tokens: 8192, Repetition penalty: 1.02).