Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Hierarchical Optimization via LLM-Guided Objective Evolution for Mobility-on-Demand Systems

Authors: Yi Zhang, Yushen Long, Yun Ni, Liping Huang, Xiaohong Wang, Jun Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments based on scenarios derived from both the New York and Chicago taxi datasets demonstrate the effectiveness of our approach, achieving an average improvement of 16% compared to state-of-the-art baselines.
Researcher Affiliation	Collaboration	1 Agency for Science, Technology and Research, Singapore 2 Morgan Stanley Asia Pte. 3 Onto Innovation Inc. 4 School of computing and communications, Lancaster University, UK EMAIL EMAIL, EMAIL EMAIL
Pseudocode	Yes	Algorithm 1 Generate New Individual Algorithm 2 LLM-Optimizer Interaction Protocol
Open Source Code	Yes	The source code can be found in: https://github.com/ yizhangele/llm-guided-mod-optimization.
Open Datasets	Yes	The 9 testing scenarios in Table 2 are constructed using the New York taxi dataset [39]. We also test on Chicago taxi dataset [40]... [39] New York City Taxi and Limousine Commission. Tlc trip record data. https://www.nyc. gov/site/tlc/about/tlc-trip-record-data.page. [40] Chicago Data Portal. Taxi trips. https://data.cityofchicago.org/Transportation/ Taxi-Trips-2013-2023-/wrvz-psew/about_data.
Dataset Splits	No	While our work does not involve training machine learning models in the traditional sense, it integrates a well-established pretrained LLM with a mathematical optimization framework. As such, there are no training/test data splits or model training procedures to report.
Hardware Specification	Yes	All optimizer-based methods, either manual objectives or our adaptive-objective method, optimization solver Gurobi [42] is adopted to solve the problem running on a PC with 13th Gen Intel Core i9-13900KF 32 CPU up to 5.80 GHz and RAM 32GB.
Software Dependencies	Yes	In our experimental setup, we utilize the Deep Seek-R1-Distill-Qwen-32B [41] model through the Hugging Face platform API as the default large language model for all LLMbased methods... optimization solver Gurobi [42] is adopted to solve the problem running on a PC with 13th Gen Intel Core i9-13900KF 32 CPU up to 5.80 GHz and RAM 32GB.
Experiment Setup	Yes	In our experimental setup, we utilize the Deep Seek-R1-Distill-Qwen-32B [41] model through the Hugging Face platform API as the default large language model for all LLMbased methods, which allow us to evaluate the adaptability of our method on smaller LLMs, thereby highlighting its potential applications. The temperature parameter is configured to 0.9. LLM-based methods all executed 3 times for each scenario, and the mean value of these three runs is reported in Tables 2 and 3. Fun Search is performed under 20 iterations. Eo H and our method all employ 10 iterations with a population size of 5.