Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Accelerating Unbiased LLM Evaluation via Synthetic Feedback
Authors: Zhaoyi Zhou, Yuda Song, Andrea Zanette
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate a reduction in human annotations by up to 12.2% with an off-the-shelf synthetic evaluator and up to 24.8% with a finetuned variant. Our experiments demonstrate a reduction in human annotations by up to 12.2% with an off-the-shelf synthetic evaluator and up to 24.8% with a finetuned variant. |
| Researcher Affiliation | Academia | 1Carnegie Mellon University. Correspondence to: Zhaoyi Zhou <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Control Variates Evaluation |
| Open Source Code | Yes | Our code is available at https://github.com/Zanette-Labs/control_variates_evaluation. |
| Open Datasets | Yes | Chat Bot Arena (Zheng et al., 2023) contains 33k human-annotated preferences. MT Bench (Zheng et al., 2023) contains about 3.36k human-annotated preferences. We utilize the validation split of the Help Steer2 dataset as our benchmark. |
| Dataset Splits | Yes | The testing of Control Variates with finetuning (Line 3 of Algorithm 1) is done in a crossvalidation manner. Suppose there are K LLMs generating responses in the evaluation dataset. Our finetuning procedure trains K reward models, each by leaving out the data for a specific LLM. We utilize the validation split of the Help Steer2 dataset as our benchmark. |
| Hardware Specification | Yes | The experiments are run on H100 GPUs. Finetuning Skywork-8B requires 4 GPUs. |
| Software Dependencies | No | The paper mentions models like GRM-Gemma-2Bsftreg, Armo RM-Llama38B, Skywork-Reward Llama-3.1-8B-v0.2, and GPT-4, and discusses learning rates and batch sizes, but does not provide specific version numbers for software libraries or frameworks used for implementation (e.g., Python, PyTorch, TensorFlow, CUDA versions). |
| Experiment Setup | Yes | When finetuning Skywork-8B and GRM-2B on Chatbot Arena and MT Bench, we use global batch size 32 and train for 1 epoch. The finetuning of GRM-2B on Chatbot Arena uses learning rate 1e-6, others all use learning rate 3e-6. We tested learning rates in {1 10 7, 3 10 7, 1 10 6, 3 10 6, 1 10 5, 3 10 5} and batch sizes in {32, 64, 128}. |