Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Let the LLM Stick to Its Strengths: Learning to Route Economical LLM

Authors: Yi-Kai Zhang, Shiyin Lu, Qingguo Chen, Weihua Luo, De-Chuan Zhan, Han-Jia Ye

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Across six datasets, LLMRec achieves an average cost reduction of over 38% while maintaining accuracy and consistently outperforming baselines in converging toward the Pareto frontier.
Researcher Affiliation	Collaboration	Yi-Kai Zhang1,2 Shiyin Lu3 Qing-Guo Chen3 Weihua Luo3 De-Chuan Zhan1,2 Han-Jia Ye1,2 1School of Artificial Intelligence, Nanjing University 2National Key Laboratory for Novel Software Technology, Nanjing University 3AI Business, Alibaba Group
Pseudocode	No	The paper describes the methodology of LLMRec but does not present it in a structured pseudocode or algorithm block format.
Open Source Code	Yes	The code for the methods is provided in the supplemental material.
Open Datasets	Yes	We consider general evaluation datasets, commonsense reasoning, math reasoning, code generation, symbolic reasoning, and specific domain datasets such as medical, law, and financial datasets, totaling 35 datasets... benchmarks such as MMLU [24], MMLU-Pro [51], BBH [47], ARCChallenge [6], Truthful QA [32], Winogrande [43], and Hella Swag [60]. For reasoning capabilities, we consider domains like mathematics (MATH [25], MMLU-STEM, GSM8K [12]) and code generation (Human Eval [10], Human Eval+ [34], MBPP [4], MBPP+ [34]).
Dataset Splits	No	We sample approximately 1100k of these for training, with stronger-performing pairs being assigned higher sampling weights. Out of the 35 datasets, 24 are multiple-choice datasets, and 11 are fill-in-the-blank or question-answer datasets, including token generation via the generate method.
Hardware Specification	No	In the appendix, we describe the computational resources used to complete model training and experiments. (However, the appendix content is not provided in the given text.)
Software Dependencies	No	The paper does not explicitly state specific software dependencies with version numbers.
Experiment Setup	No	The paper describes the training data construction, evaluation metrics, and general approach, but it does not specify concrete experimental setup details such as hyperparameters (e.g., learning rate, batch size, epochs) for the LLMRec model training.