Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

From Crowdsourced Data to High-quality Benchmarks: Arena-Hard and Benchbuilder Pipeline

Authors: Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, Ion Stoica

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We apply Bench Builder to datasets such as Chatbot Arena and Wild Chat-1M, extracting challenging prompts. To validate benchmark quality, we propose new metrics to measure a benchmark s alignment with human preferences and ability to separate models. We release Arena-Hard-Auto, a benchmark consisting of 500 challenging prompts curated by Bench Builder. Arena-Hard-Auto provides 3x higher separation of model performances compared to MT-Bench and achieves 98.6% correlation with human preference rankings
Researcher Affiliation Academia 1University of California, Berkeley. Correspondence to: Tianle Li <EMAIL>.
Pseudocode No The paper describes methods and processes (e.g., Bench Builder Pipeline in Figure 2, LLM-Judge System Instruction in Appendix G) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or structured code-like procedures.
Open Source Code Yes Our code is available at https://github.com/ lmarena/arena-hard-auto. and We open-source both Bench Builder pipeline and Arena Hard-Auto benchmark1. 1Our code is available at: https://github.com/ lmarena/arena-hard-auto
Open Datasets Yes We apply Bench Builder to crowd-sourced datasets, both Chatbot Arena (Chiang et al., 2024) and Wild Chat1M (Zhao et al., 2024), demonstrating that it can robustly generate high-quality benchmarks that differentiate models.
Dataset Splits No The paper describes the curation of benchmarks like Arena-Hard-Auto (500 prompts) and Wild-Hard-Auto (250 prompts) as evaluation sets, but does not specify any further training/validation/test splits of these benchmarks or other datasets for the experiments conducted.
Hardware Specification No The paper mentions costs associated with using LLM APIs (GPT-4-Turbo, Llama-3-70B-Instruct) for annotation but does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments or training their pipeline.
Software Dependencies Yes To validate qualities assigned by GPT-4-Turbo, we construct ground truth labels for 200 sampled queries by collecting majority votes from GPT-4o (Open AI, 2024b), Claude-3-Opus, and Gemini-1.5-Pro (Reid et al., 2024)... In Table 4... Model GPT4-T (gpt-4-1106-preview), Claude-3-Opus, Gemini1.5-Pro (gemini-1.5-pro-0514), Llama3-70B (llama-3-70b-instruct).
Experiment Setup Yes Then we use GPT-4-Turbo (Open AI, 2023b) as a judge to assign a quality score to each prompt and remove any prompts. Prompts with a score less than 6 and topic clusters with a mean score less than 5 are discarded... To construct a 500-prompt benchmark, we sample 2 prompts each from 250 randomly selected clusters... We evaluate a model on a given prompt using a pairwise comparison against a strong baseline model (e.g., GPT-4-0314)... judge model (e.g., GPT-4-Turbo or Gemini-1.5-Pro) then scores each output by rating its preference between the pair on a 5-point Likert scale... To ensure consistency, we utilize chain-of-thought (Wei et al., 2023) prompting... We adopt the Bradley & Terry (1952) model to produce model s the final model scores... We use seed 42 for all experiments in this paper unless stated otherwise.