Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Solver-Informed RL: Grounding Large Language Models for Authentic Optimization Modeling

Authors: Yitian Chen, Jingfan Xia, Siyu Shao, DongDong Ge, Yinyu Ye

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on diverse public benchmarks demonstrate that models trained with our SIRL framework achieve state-of-the-art performance, substantially outperforming existing methods in generating accurate and executable optimization models. Speciﬁcally, our SIRL-32B model surpasses Deep Seek-V3 and Open AI-o3 on the majority of these benchmarks. Our code is publicly available at https://github.com/Cardinal-Operations/SIRL.
Researcher Affiliation	Collaboration	Yitian Chen1 , Jingfan Xia2,1 , Siyu Shao1,3, Dongdong Ge4 , Yinyu Ye4,5 1 Cardinal Operations, China 2 Shanghai University of Finance and Economics 3 The University of Hong Kong 4 Antai School of Economics and Management, Shanghai Jiao Tong University 5 Department of Management Science and Engineering, Stanford University EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the SIRL framework, its components, and training procedures in narrative text. While it references algorithms like REINFORCE++ and discusses structured prompts, it does not contain a dedicated pseudocode block, algorithm box, or a clearly labeled 'Algorithm X' with step-by-step instructions in a code-like format for its own methodology.
Open Source Code	Yes	Our code is publicly available at https://github.com/Cardinal-Operations/SIRL.
Open Datasets	Yes	We evaluated the performance of our trained models on four key optimization modeling datasets: NL4OPT [82], MAMO [83], Industry OR [25], and Opt MATH [26]. ... For the beneﬁt of future research, a comprehensive description detailing our correction methodology, the rationale for each modiﬁcation, and the revised datasets are publicly available at the Github repository https://github.com/Cardinal Operations/SIRL/blob/main/test_data/README.md.
Dataset Splits	Yes	Starting from the synthetic dataset, we applied a ﬁltering strategy guided by the principle Less is More [73, 74]. Speciﬁcally, we excluded (question, answer) pairs if the baseline Qwen-32BInstruct model [22] achieved an 80% success rate (8/10 attempts across different prompting roles) in generating executable code matching the ground-truth optimal value, as such samples were deemed too trivial. This process yielded approximately 70,000 samples. From this set, we then randomly sampled 10,000 instances to form our training data. We evaluated the performance of our trained models on four key optimization modeling datasets: NL4OPT [82], MAMO [83], Industry OR [25], and Opt MATH [26] using pass@1 accuracy.
Hardware Specification	Yes	All experiments for the 7B model were conducted on a single compute node equipped with eight 80GB NVIDIA H100 GPUs.
Software Dependencies	No	We used Qwen2.5-7B-Instruct [22] as the base model and adapted the Verl framework [85] for reinforcement learning training, modifying its implementation to incorporate our novel surrogate function design with the Partial KL strategy and two-stage reward mechanism. The example code in Appendix A.2 uses 'gurobipy', but no specific version for it or Python is mentioned.
Experiment Setup	Yes	The key hyperparameters for SIRL training are detailed in Table 6: (Algorithm: reinforce_plus_plus, Data Batch size: 128, Learning rate: 1e-6, Max prompt length: 2048, Max response length: 8192, Truncation: left, Actor/Rollout KL loss type: low_var_kl, KL loss coefﬁcient: 0.005, Rollout number: 8, PPO mini batch size: 8, PPO micro batch Size per GPU: 4, Clip ratio low: 0.20, Clip ratio high: 0.28). The exact sampling hyperparameters used to generate our main results are speciﬁed in Table 7: (n: 1, Temperature: 0.5, Top p: 0.9, Max tokens: 8192, Repetition penalty: 1.02).