Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Cost-Aware Contrastive Routing for LLMs

Authors: Reza Shirkavand, Shangqian Gao, Peiran Yu, Heng Huang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive Evaluation: We evaluate our method on three routing benchmarks spanning both open-source checkpoints and proprietary APIs. It achieves up to 25% higher accuracy cost efficiency on a fixed pool of LLMs and demonstrates strong robustness to unseen models and out-of-distribution prompts at inference time.
Researcher Affiliation	Academia	Reza Shirkavand Department of Computer Science University of Maryland College Park EMAIL Gao Department of Computer Science Florida State University EMAIL Yu Department of Computer Science and Engineering University of Texas at Arlington EMAIL Huang Department of Computer Science University of Maryland College Park EMAIL
Pseudocode	No	The paper describes methods using mathematical equations and structured prose, but does not present any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	No	All datasets used are open source. We will provide the code for our experiments after paper decision is available.
Open Datasets	Yes	Datasets & Benchmarks We train our router and evaluate it on three datasets: Embed LLM [105], Mix Instruct [39], and Router Bench [34]. For Embed LLM and Mix Instruct, we sample 192 probes from their respective validation sets. Each probe is processed to extract logit-based descriptors by capturing the top K = 256 tokens over a horizon of T = 10 tokens (Equation (2)), resulting in a 256-dimensional vector per model. For Router Bench, we sample 192 probes from its training set, ensuring these probes are excluded from the training data used for the contrastive router. We compute perplexity-based descriptors on Router Bench and use GPT-2 [66].
Dataset Splits	Yes	For Embed LLM and Mix Instruct, we sample 192 probes from their respective validation sets. Each probe is processed to extract logit-based descriptors by capturing the top K = 256 tokens over a horizon of T = 10 tokens (Equation (2)), resulting in a 256-dimensional vector per model. For Router Bench, we sample 192 probes from its training set, ensuring these probes are excluded from the training data used for the contrastive router. ... In the out-of-distribution (OOD) experiments, we divided the prompts in the Embed LLM dataset into two challenging sets based on their categories: STEM-related (Science, Technology, Engineering, and Mathematics) and Non-STEM-related (covering Social sciences, Humanities, Arts, etc.). ... Our splitting yielded 18,193 out of 36,054 total training questions and 1,060 distinct test prompts.
Hardware Specification	Yes	All training and descriptor extraction are done on RTX6000Ada GPUs with 48GB GPU memory.
Software Dependencies	No	We use a frozen sentence-transformers/all-Mini LM-L6-v2[70] model as the embedding backbone (Φ(x) in Section3) across all experiments. ... We compute perplexity-based descriptors on Router Bench and use GPT-2 [66]. ... Training is performed for 10 epochs using the Adam W optimizer...
Experiment Setup	Yes	Training We use a frozen sentence-transformers/all-Mini LM-L6-v2[70] model as the embedding backbone (Φ(x) in Section3) across all experiments. Our trainable router component is a two-layer MLP, denoted as gθ(.), which projects prompt embeddings into the expert descriptor space. We train our contrastive router on the training splits of each dataset, excluding the probe examples from Router Bench. Training is performed for 10 epochs using the Adam W optimizer with a batch size of 512 and a learning rate of 5 10 4. For the cost spectrum loss (Equation (8)), we set the number of cost bands to K = 5 and the negative cost penalty to λ = 0.1. The hyperparameters for the linear schedule of band-specific temperatures (Equation (7)) are set as α = 0.25 and τmin = 0.05.