Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge

Authors: Swarnadeep Saha, Xian Li, Marjan Ghazvininejad, Jason E Weston, Tianlu Wang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on five reward modeling benchmarks Reward Bench, PPE, RM-Bench, Judge Bench, and Follow Bench Eval spanning instructions across categories of Chat, Safety, Code, Math, and fine-grained multilevel constraints. On Reward Bench and PPE, Eval Planner achieves new state-of-the-art scores (e.g., 93.9 on Reward Bench) for generative reward models, outperforming baselines that train on fewer amount of, and synthetically generated, preference pairs. Additional experiments on other benchmarks like RM-Bench, Judge Bench, and Follow Bench Eval further highlight the utility of both planning and reasoning for building robust LLM-as-a-Judge reasoning models. [...] We conduct a set of comprehensive ablations that highlight the effectiveness of Eval Planner s (1) unconstrained evaluation plans over constrained ones, (2) iterative optimization recipe of these plans, and (3) data-efficiency, allowing it to obtain competitive performance with as few as 5K synthetic preference pairs.
Researcher Affiliation Industry 1FAIR at Meta. Correspondence to: Swarnadeep Saha <EMAIL>.
Pseudocode No The paper describes the methodology and training algorithm in narrative text and equations, and includes prompt templates in figures (Figures 3 and 4), but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper mentions using the fairseq2 library (Balioglu, 2023) and vLLM (Kwon et al., 2023) for model training and inference respectively, and provides a URL for fairseq2 (http://github. com/facebookresearch/fairseq2). However, this refers to third-party tools used, not the release of the authors' own implementation code for Eval Planner. There is no explicit statement or link indicating that the source code for the methodology described in this paper is openly available.
Open Datasets Yes We select prompts from two different sources Wild Chat (Zhao et al., 2024) and MATH (Hendrycks et al., 2021). [...] We test Eval Planner on the following pairwise evaluation benchmarks. Reward Bench (Lambert et al., 2024). [...] Preference Proxy Evaluations (PPE) (Frick et al., 2025). [...] Follow Bench Eval. We build this new evaluation benchmark from Follow Bench (Jiang et al., 2024). [...] RM-Bench (Liu et al., 2024). [...] Judge Bench (Tan et al., 2024).
Dataset Splits Yes From this, we select a random subset of 5K instructions (consisting of 2.5K from Wild Chat and 2.5K from MATH) for SFT and the first iteration of DPO. We reserve the rest for the second iteration of DPO. In each iteration, we sample 5 plans and for each plan, we sample 8 executions (4 in each order of response pair) using a temperature of 0.8 and top p of 0.95. [...] As validation set, we choose 150 samples from each of Wild Chat and MATH, which we use for checkpoint selection.
Hardware Specification No The paper states 'We develop Eval Planner with either Llama-3.1-70B-Instruct or Llama-3.3-70B-Instruct as the seed model' but does not specify the underlying hardware (e.g., GPU models, CPU types, or cloud computing instances) used for training or inference.
Software Dependencies No We use the fairseq2 library (Balioglu, 2023) for model training and v LLM (Kwon et al., 2023) for inference. However, specific version numbers for fairseq2 and vLLM are not provided in the text.
Experiment Setup Yes In each iteration, we sample 5 plans and for each plan, we sample 8 executions (4 in each order of response pair) using a temperature of 0.8 and top p of 0.95. We develop Eval Planner with either Llama-3.1-70B-Instruct or Llama-3.3-70B-Instruct as the seed model to show the generalizability of our approach across multiple seed models. As validation set, we choose 150 samples from each of Wild Chat and MATH, which we use for checkpoint selection. To account for position bias in pairwise evaluation, we double the number of examples in the validation set by considering both orders of response pairs. We use the fairseq2 library (Balioglu, 2023) for model training and v LLM (Kwon et al., 2023) for inference. All models are trained for a maximum of 1K steps, saving checkpoints every 100 steps and doing early stopping based on the validation set. Detailed training hyperparameters are provided in Table 12. [...] Table 12. Training hyper-parameters used for SFT and DPO of Eval Planner. Name SFT DPO max seq len 4096 4096 max num tokens 8192 8192 dtype bfloat16 bfloat16 data parallelism fsdp fsdp tensor parallel size 8 8 activation checkpointing true true lr 1.0e-06 5.5e-08 betas 0.9, 0.95 0.9, 0.95 weight decay 0.1 0.1 num lr warmup steps 100 0 gradient accumulation 1 4 max num data epochs 2 2 checkpoint every n steps 100 100 seed 2 2