Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DPO Meets PPO: Reinforced Token Optimization for RLHF

Authors: Han Zhong, Zikang Shan, Guhao Feng, Wei Xiong, Xinle Cheng, Li Zhao, Di He, Jiang Bian, Liwei Wang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that RTO performs better than PPO and other direct preference learning algorithms. In particular, RTO outperforms PPO by 7.5 points on the Alpaca Eval 2 benchmark and by 4.1 points on Arena-Hard. Our code and models are available at https://github.com/zkshan2002/RTO. 5. Experiments 5.1. Benchmark Results We present a thorough comparison of RTO with PPO and other widely used direct preference learning algorithms on popular benchmarks to highlight RTO s strong performance.
Researcher Affiliation Collaboration 1Center for Data Science, Peking University 2State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 3University of Illinois Urbana Champaign 4Microsoft Research Asia. Correspondence to: Han Zhong <EMAIL>, Li Zhao <EMAIL>, Di He <EMAIL>, Liwei Wang <EMAIL>.
Pseudocode Yes Algorithm 1 Reinforced Token Optimization (Theoretical Version) Algorithm 2 Reinforced Token Optimization (Practical Version)
Open Source Code Yes Our code and models are available at https://github.com/zkshan2002/RTO.
Open Datasets Yes To assess the overall quality of generated text responses across multiple dimensions (e.g., helpfulness, accuracy, and clarity), we employ the dataset Ultra Feedback (Cui et al., 2023) that contains comprehensive human feedback annotations on model outputs. We consider the Summarization task (V olske et al., 2017), where the model is required to generate a concise summary for a given post from the Reddit forum. Specifically, we fine-tune the foundational model using the Reddit TL;DR summarization dataset (V olske et al., 2017).
Dataset Splits No The paper mentions using "a binarized version of the Ultra Feedback dataset, while all reinforcement learning uses a prompt-only version." and "We conducted experiments using only a fraction of the full dataset." For the summarization task, it mentions "For each GPT-4 evaluation, we use 100 samples." However, it does not specify explicit training/validation/test splits (e.g., percentages or counts) for the primary datasets used in the model training and evaluation.
Hardware Specification Yes Our experiments is conducted on 8 80G A100 GPUs.
Software Dependencies No The paper mentions using the Adam optimizer (Kingma & Ba, 2017) and that its code is based on the Open RLHF (Hu et al., 2024) repo. It also mentions using Llama-3-8B and Open RLHF/Llama-3-8b-sft-mixture models. However, specific version numbers for programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or other key libraries are not provided.
Experiment Setup Yes D. Implementation Details Training hyperparameters We use Adam optimizer (Kingma & Ba, 2017) across all experiments with varying learning rates, (0.9, 0.95) betas and no weight decay. We apply a cosine learning rate schedule with 3% warming steps and 10% minimum learning rate. All experiments use a single epoch, since we do not observe much gains from further training. Additionally, we set the max sequence length to 2048. We include all other method-specific hyperparameters in the tables below. Tables throughout Appendix D (e.g., Reward Model (Ultra Feedback), DPO (Ultra Feedback), PPO (Ultra Feedback), RTO (Ultra Feedback), Baselines (Ultra Feedback), SFT (TL;DR), DPO (TL;DR), PPO (TL;DR), RTO and DPPO (TL;DR)) provide specific hyperparameters such as Learning Rate, Batch Size, Maximum Sequence Length, KL Coefficient, PPO Update Step, PPO Clip Coefficient, GAE λ, DPO Reward Rescale, Reward Rescale, and Epochs.