Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Efficient Training-Free Online Routing for High-Volume Multi-LLM Serving

Authors: Fangzhou Wu, Sandeep Silwal

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our algorithm achieves a competitive ratio of 1 o(1) under natural assumptions, which is further validated by extensive experiments across 3 benchmark datasets and 8 baselines, showing an average improvement of 3.55 in overall performance, 1.85 in cost efficiency, and nearly 4.25 in throughput. Our code is available at https://github.com/fzwark/PORT.
Researcher Affiliation	Academia	Fangzhou Wu University of Wisconsin Madison EMAIL Sandeep Silwal University of Wisconsin Madison EMAIL
Pseudocode	Yes	Algorithm 1 Routing with Learned γ 1: P ϵQ (First ϵ-frac. of queries) 2: for j P do 3: Randomly pick wj {0} [M] 4: Estimate ˆdij and ˆgij, i [M] 5: if wj > 0 then 6: Route j to wj-th LLM 7: Compute γ arg minγ F(γ, P) 8: Y Q \ P 9: for j Y do 10: Estimate ˆdij and ˆgij, i [M] 11: Compute α ˆdij ˆgijγ i , i [M] 12: Route j to i = arg maxi(α ˆdij ˆgijγ i )
Open Source Code	Yes	Our code is available at https://github.com/fzwark/PORT.
Open Datasets	Yes	Benchmarks. We use 3 different benchmarks in our experiments: Router Bench [23], SPROUT [50], and Open LLM Leaderboard v2 [16]. Router Bench contains 11 different LLMs, and we randomly sample 10000 queries as the test query dataset and use the remaining data as the historical dataset. Open LLM Leaderboard v2 contains 18 different LLMs, where we similarly sample 10000 queries for the test dataset and use the rest as the historical dataset. SPROUT contains 13 LLMs, and we use the training set as the historical dataset, while the validation and test sets are combined to form the test queries.
Dataset Splits	Yes	For Router Bench, [...] we randomly sample 10000 queries as the test query dataset and use the remaining data as the historical dataset. [...] For SPROUT, [...] We use the provided training set as the historical dataset, and combine the validation and test sets to construct the test query set. [...] For Open LLM Leaderboard v2, [...] we randomly sample 10000 queries as the test set and use the remainder as the historical data.
Hardware Specification	Yes	Devices. All experiments are conducted on a machine equipped with 16 CPUs and 32GB of memory. For training the Roberta models used in the model-based baselines, we use two NVIDIA H200 GPUs.
Software Dependencies	No	We implement all optimization codes using the CVXPY [12] package. For the one-time optimization step in our algorithm, we use the L-BFGS-B solver [66]. For Batch Split, which involves solving a linear program for each batch, we adopt the HIGHS solver [24]. In the main setting, the queries are embedded using bge-base-en-v1.5 [59]. For the diversity consideration, we additionally evaluate with two different embedding models: SFR-Embedding-2_R [40] and gte-Qwen2-1.5B-instruct [35].
Experiment Setup	Yes	Table 1 presents the main results under the 3 benchmarks, using α = 0.0001 and ϵ = 0.025 for our algorithm, with the maximum available test queries, and historical data in the evaluation. [...] We adopt HNSW [38] as the main ANNS algorithm and set the number of candidate neighbors (\|Rj\|) to 5 for both ANNS and KNN. For Batch Split, we use a mini-batch size of 256 to balance LP computation cost with the low-latency requirements of online routing.