Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Reward-Guided Prompt Evolving in Reinforcement Learning for LLMs
Authors: Ziyu Ye, Rishabh Agarwal, Tianqi Liu, Rishabh Joshi, Sarmishta Velury, Quoc V Le, Qijun Tan, Yuan Liu
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We run extensive experiments showing eva universally improves the performance of both online RL (e.g., RLOO, OAIF) and offline RL (e.g., DPO, SPPO, Sim PO, ORPO) for LLMs and is SOTA on various challenging real-world benchmarks. |
| Researcher Affiliation | Collaboration | 1Google Deep Mind 2University of Chicago. Correspondence to: Ziyu Ye and Yuan Liu <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 eva. Input: initial policy πθ0, initial set of prompt X0 1: for iteration t = 1, 2, . . . do / creator step / 2: estimate: Xt 1 {(xi, info(xi)) | xi Xt 1} sample: X info t 1 {xi drawn w.p. info(xi)} evolve: Xt evolve(X info t 1) / solver step / 3: generate: xi X info t , {y(j) i } πθt 1( | xi) annotate reward: X t X info t {(y(j) i , r(j) i )} optimization: θt θt 1 + η θJX t(θ) 4: end for 5: return final solver policy πθT |
| Open Source Code | No | The paper includes: 'Our codebase primarily relies on transformers==4.40.0. For the response generation of GEMMA models at the training stage, we use vllm==0.5.4 with flashinfer backend for CUDA 12.4 and torch 2.4. For evolving prompts, we use distilabel==1.3.2, and use Lite LLM to serve Gemini (default to be gemini-1.5-pro) and transformers models (default to be gemma-2-9b-it).' and 'As authors, we will be committed to supporting these efforts by sharing our findings and implementations to promote open and responsible research and development.' However, it does not provide a direct link to the source code for the methodology described in this paper, nor an explicit statement of its release. |
| Open Datasets | Yes | We use Ultra Feedback (Cui et al., 2023) as the training dataset, which contains diverse highquality prompts that are primarily human-generated. We use the instruction-finetuned GEMMA-2-9B (Team et al., 2024b) as the base (θ0)3... We use: (i) Alpaca Eval 2.0 (Dubois et al., 2024), which assesses general instruction following with 805 questions; (ii) MT-Bench (Zheng et al., 2023), which evaluates multi-turn instruction following with 80 hard questions in 8 categories; (iii) Arena-Hard (Li et al., 2024), which is derived from 200K user queries on Chatbot Arena with 500 challenging prompts across 250 topics. |
| Dataset Splits | Yes | Unless stated otherwise, each iteration uses 10K prompts (the initial prompt set), referred to as 1x. In offline RLHF, we denote θt t+1 as the one trained with new human prompts from the t-th checkpoint. θt t denotes the one trained with evolved prompts from the t-th checkpoint without any new human prompts. In online RLHF, training is a continual iteration and θ0 1 (nx) denotes training with 10n K prompts in total, mixed and evolved from the initial. ... For evolving prompts (e.g., evolving X1 to X 1), with the calculated informativeness metric for each prompt, we normalize them as the weight to do weighted sampling for a 25% informative subset to get X info 1 . We then iterate over in X info 1 and call Evol Instrut (Xu et al., 2023) as the plug-in evolving method (with the number of evolutions as 4) ... Next we uniformly select 80% prompts from this evolved dataset and 20% from the original dataset (i.e., the buffer) to form X 1. |
| Hardware Specification | Yes | All experiments are conducted on 8x NVIDIA H100 SXM GPUs. |
| Software Dependencies | Yes | Our codebase primarily relies on transformers==4.40.0. For the response generation of GEMMA models at the training stage, we use vllm==0.5.4 with flashinfer backend for CUDA 12.4 and torch 2.4. For evolving prompts, we use distilabel==1.3.2, and use Lite LLM to serve Gemini (default to be gemini-1.5-pro) and transformers models (default to be gemma-2-9b-it). For evaluation on all benchmarks, we use sglang==0.2.10 and openai==1.35.14, with gpt-4-1106-preview as the judge model and gpt-4-0314-preview as the baseline model. |
| Experiment Setup | Yes | Hyperparameter settings. We follow the original hyperparameter settings as in (Hong et al., 2024; Meng et al., 2024; Wu et al., 2024), default to be: DPO ORPO Sim PO SPPO learning rate 5e-7 5e-7 8e-7 5e-7 learning rate scheduler cosine cosine cosine linear β 0.05 / 10 0.001 γ / / 5 / λ / 0.5 / / no. epochs per iter 2 1 1 6 warmup ratio per iter 0.1 0.1 0.1 0.1 effective batch size 8 8 32 8 max length 2048 2048 2048 1024 max prompt length 1024 1024 1024 512 optimizer adamw adamw adamw rmsprop. For solver we generate 6 responses per prompt. We use 42 as the random seed. |