Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Causal LLM Routing: End-to-End Regret Minimization from Observational Data

Authors: Asterios Tsiourvas, Wei Sun, Georgia Perakis

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on public benchmarks show that our method outperforms existing baselines, achieving state-of-the-art performance across different embedding models.We conduct comprehensive experiments on two public benchmarks, demonstrating that our regret-minimizing and heterogeneous cost-aware approaches consistently outperform existing baselines. Our methods achieve state-of-the-art performance across both BERT-based and LLa MA-based embeddings, highlighting their robustness and practical effectiveness.
Researcher Affiliation	Collaboration	Asterios Tsiourvas MIT Wei Sun IBM Research Georgia Perakis MIT
Pseudocode	No	The paper describes methodologies and proofs but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	You can find all the details to reproduce the experiments in Section 4 and in Section D of the Appendix. We will also release code in the supplementary material. Furthermore, the code used is provided in the supplemental materials.
Open Datasets	Yes	We evaluate our methods on two publicly available benchmarks for LLM routing: Router Bench [Hu et al., 2024] and SPROUT [Somerstep et al., 2025].
Dataset Splits	Yes	SPROUT includes a predefined split: 80% for training, with the remaining 20% evenly divided between validation and test sets. To maintain consistency in evaluation, we adopt the same split strategy for Router Bench, applied deterministically at the prompt level to ensure reproducibility.
Hardware Specification	Yes	Experiments were conducted on an internal compute cluster equipped with an Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, 512 GB of RAM, and two NVIDIA V100 GPUs with 16 GB memory each.
Software Dependencies	Yes	All experiments were implemented in Python 3.8.12 Van Rossum and Drake [2009], using Py Torch 2.4.1+cu121 Paszke et al. [2019] and Scikit-learn Buitinck et al. [2013].
Experiment Setup	Yes	The model is trained with the Adam optimizer (learning rate 1e-4) for up to 10,000 epochs, using early stopping with a patience of 100. A softmax output with temperature τ = 100 is used to control the sharpness of the output probabilities. All neural models used in our experiments share the same architecture for fairness and comparability. We use a 2-layer feedforward neural network with GELU activation and 200 hidden units per layer. Models are trained using the Adam optimizer with a learning rate of 10^-4, batch size of 128, and a maximum of 10,000 epochs. Early stopping is applied with a patience of 100 epochs based on validation regret. The temperature parameter for the softmax-based regret objective is set to 100, and to 1000 for the interval model to allow smoother gradients across budget intervals.