Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

BLEUBERI: BLEU is a surprisingly effective reward for instruction following

Authors: Yapei Chang, Yekyung Kim, Michael Krumdick, Amir Zadeh, Chuan Li, Chris Tanner, Mohit Iyyer

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we show first that BLEU, a basic string-matching metric, surprisingly matches strong reward models in agreement with human preferences on general instruction-following datasets. Based on this insight, we develop BLEUBERI, a method that first identifies challenging instructions and then applies Group Relative Policy Optimization (GRPO) using BLEU directly as the reward function. We demonstrate that BLEUBERI-trained models are competitive with models trained via reward model-guided RL across four challenging instruction-following benchmarks and three different base language models. A human evaluation further supports that the quality of BLEUBERI model outputs is on par with those from reward model-aligned models.
Researcher Affiliation Collaboration Yapei Chang University of Maryland, College Park Yekyung Kim University of Maryland, College Park Michael Krumdick Kensho Amir Zadeh Lambda AI Chuan Li Lambda AI Chris Tanner Kensho Mohit Iyyer University of Maryland, College Park
Pseudocode No The paper describes the GRPO training method and BLEUBERI method in prose within Section 3, but does not present them as structured pseudocode or algorithm blocks.
Open Source Code Yes We release our code and data at https://github.com/lilakk/BLEUBERI.
Open Datasets Yes In experiments on general instruction-following tasks in the LMSYS chatbot_arena_conversations dataset, BLEU with five synthetic references achieves almost the same agreement (74.2%) with human preferences as a powerful 27B-parameter reward model (75.6%). We draw from the Tulu3 SFT mixture [24], which contains 939K examples across 18 data sources covering diverse tasks. We evaluate our models using four benchmarks: MT-Bench [79], a set of 80 manually curated, high-quality multi-turn questions; Arena Hard v1 and v2 [26], two distinct sets of 500 challenging prompts drawn from real-world user queries. ... and Wild Bench [28], comprising 1,024 complex real-world queries.
Dataset Splits Yes To ensure consistency in our main experiments, we train all methods on the 5,000 hardest examples as ranked by BLEU. We sample 120 examples (30 from each benchmark) and asked two annotators to compare the outputs from the BLEU-trained and RM-trained models, denoted OB and OR respectively.
Hardware Specification Yes All training runs are performed on single GH200 GPUs using TRL with Deep Speed-Ze RO3 (https://www.deepspeed.ai/2021/03/07/zero3-offload.html).
Software Dependencies No All training runs are performed on single GH200 GPUs using TRL with Deep Speed-Ze RO3 (https://www.deepspeed.ai/2021/03/07/zero3-offload.html). In our experiments, we use the huggingface implementation with tokenizer_13a, and we apply smoothing to prevent zero scores for higher-order n-gram precisions when no matches are found.
Experiment Setup Yes For all methods, we train for one full epoch on the BLEU-selected 5K data. For SFT, we use a learning rate of 5e-6 with a global batch size of 32 and set max tokens (covering both input and output) to 1024. For GRPO, we set the learning rate to 1e-6, group size of 8, max prompt length and max generation length to 512 tokens, and maintain the same global batch size of 32 meaning each batch consists of 8 generations for each of 4 unique prompts.