Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Prompt-to-Leaderboard: Prompt-Adaptive LLM Evaluations

Authors: Evan Frick, Connor Chen, Joseph Tennyson, Tianle Li, Wei-Lin Chiang, Anastasios Nikolas Angelopoulos, Ion Stoica

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This section contains a suite of experiments that validate the P2L method and demonstrate its utility. In Section 3.2, we show that P2L leads to gains in human preference prediction that scale with model size and data. In Section 3.2, we show direct predictive performance on pairwise human preferences, as well as scaling behavior with data size and parameter count. In Section 3.3, we show P2L allows for optimal cost-efficient routing via the algorithm developed previously in Section 2.1.2. In Section 3.4, we use P2L to automatically identify strengths and weaknesses for different models. In Section 3.5, we explore our aggregation technique against ground truth categories leaderboards, and observe data scaling trends. Finally, in Section 3.6, we show that the P2L has reasonable performance on out-of-distribution data.
Researcher Affiliation	Academia	Evan Frick 1 Connor Chen 1 Joseph Tennyson 1 Tianle Li 1 Wei-Lin Chiang 1 Anastasios N. Angelopoulos 1 Ion Stoica 1 1University of California, Berkeley. Correspondence to: Evan Frick <EMAIL>.
Pseudocode	Yes	Algorithm 1 Optimal routing with BT estimate Input: q; W ; θ (z)j; c; C 1: Solve the LP: π = argmax π M, π c C π W q 2: Compute R = π W q 3: Solve for θ by finding the root of the following implicit equation: X a qa σ θ θ (z)a = R Output: Optimal router π , estimate of router s BT coefficient θ
Open Source Code	Yes	Our code is available at this Git Hub link: https://github.com/lmarena/p2l.
Open Datasets	Yes	To train a P2L model, we follow this three-step procedure: (1) Begin with a pre-trained, instruction-tuned LLM. (2) Remove the existing language model head and replace it with a randomly initialized coefficient head. In the BT case, the coefficient head is a linear layer producing M outputs, one per model. (3) Train the model by running stochastic gradient descent on all parameters to minimize the negative log-likelihood: L(θ) = n P i=1 log gθ(Zi)(Yi; Xi) . The result of this procedure is the trained model ˆθ = argminθ Θ L(θ), which is a direct generalization of (1). We train on up to n = 1.5 million crowdsourced human preference pairs from Chatbot Arena, containing M = 130 unique models. ... To assess how P2L generalizes to unseen prompts, we evaluate it on Live Bench (White et al., 2024), a verifiable, contamination-free benchmark with 1,000 questions covering diverse categories (e.g., math, coding, reasoning).
Dataset Splits	Yes	We construct a holdout validation set containing 41,507 annotated pairwise comparisons across 34 well-used models. We then measure the negative log-likelihood (validation loss) on this dataset; a lower validation loss indicates better preference prediction performance.
Hardware Specification	Yes	P2L-7B on 1.5 million data points costs less than $250 to train end-to-end using a relatively unoptimized Deepspeed and Huggingface Trainer infrastructure ($23.92 per hour for 8x H100 on Runpod).
Software Dependencies	No	P2L-7B on 1.5 million data points costs less than $250 to train end-to-end using a relatively unoptimized Deepspeed and Huggingface Trainer infrastructure ($23.92 per hour for 8x H100 on Runpod).
Experiment Setup	Yes	To train a P2L model, we follow this three-step procedure: (1) Begin with a pre-trained, instruction-tuned LLM. (2) Remove the existing language model head and replace it with a randomly initialized coefficient head. In the BT case, the coefficient head is a linear layer producing M outputs, one per model. (3) Train the model by running stochastic gradient descent on all parameters to minimize the negative log-likelihood: L(θ) = n P i=1 log gθ(Zi)(Yi; Xi) . The result of this procedure is the trained model ˆθ = argminθ Θ L(θ), which is a direct generalization of (1). ... We always train for 1 epoch. In order to study the scaling laws of P2L as a function of model size, we used the following models as the initializations: Smol LM2-{135, 360}M-Instruct and Qwen2.5-{0.5, 1.5, 3, 7}B-Instruct (Allal et al., 2024; Team, 2024).