Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Behavior Injection: Preparing Language Models for Reinforcement Learning

Authors: Zhepeng Cen, Yihang Yao, William Han, Zuxin Liu, DING ZHAO

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method across two reasoning benchmarks with multiple base models. The results demonstrate that our theoretically motivated augmentation can significantly increase the performance gain from RL over the pre-RL model. ... We evaluate BRIDGE across diverse tasks from i GSM and Prompt Bench. Extensive experiments and ablation studies demonstrate that BRIDGE enhances data co-influence and significantly improves performance in the RL stage.
Researcher Affiliation Collaboration Zhepeng Cen 1, Yihang Yao 1, William Han1, Zuxin Liu2, Ding Zhao1 1 Carnegie Mellon University, 2 Salesforce AI Research
Pseudocode Yes Algorithm 1: BRIDGE Input: Vanilla QA pairs (q, a), injection prob p Output: Injected QA pairs (q, a ) 1: Extract DAG G = (V, E) from (q, a). 2: # Construct exploration behaviors. 3: Obtain a locked node nl ahead. 4: b1 = "Let s solve [nl], ..., wait, [nl] seems to be not solvable yet, let s get back." 5: # Construct exploitation behaviors. 6: Aggregate all the information info to solve an unlocked but not solved node nu 7: b2 = "Let s solve [nu], ..., since [info], [nu] =[subgoal computation]=[nu.value]" 8: a a + b1 + b2; # Inject with prob p. 9: Return: Injected QA pair (q, a )
Open Source Code Yes Answer: [Yes] Justification: The code and dataset are provided in the supplementary material.
Open Datasets Yes Tasks. To evaluate the performance growth during RL finetuning, we conduct experiments on two benchmarks: 1) i GSM [74], a grade-school math problem benchmark involving math and common sense reasoning tasks; 2) Prompt Bench [86], a benchmark involving arithematic and logical reasoning tasks. ... We use Ve RL with Apache-2.0 License, i GSM with MIT License, and Prompt Bench with MIT license.
Dataset Splits Yes In i GSM task, we use data with 15 20 operations for finetuning. ... We train each model on corresponding dataset for 5 epochs. In the RL stage, we train on the data with the same difficulty. ... we test their performance on two problem sets separately: 1) in-distribution (In-Dist) set with operation number = 20 and 2) out-of-distribution (OOD) set with operation number = 25. Each set consists of 500 problems. ... For the Prompt Bench task, we use data with reasoning depth = 4 and 5 for SFT and RL training respectively, both of which have 0 8 redundant premises. ... We also test the performances on 1) in-distribution set with reasoning depth = 5 and 0 8 redundancy and 2) out-of-distribution set with reasoning depth = 5 and 20 22 redundancy separately.
Hardware Specification Yes All experiments can be run on a server with 2 A100 (80G). Each SFT experiment takes less than 1h. Regarding the RL experiments, it takes 8h for Qwen-1.5B and Llama-1B to complete 200-step RL and 6h for Qwen-3B to run 100-step RL on i GSM while it spends 2h to run RL on Prompt Bench (100 steps) for Qwen-1.5B and Llama-1B due to shorter rollout length.
Software Dependencies No Our implementation is based on Ve RL [100]. When computing KL divergence, we use the low variance implementation [101], aligning with the GRPO implementation [23]. ... rollout backend vllm [102]
Experiment Setup Yes Table 3: Configurations of SFT training Configurations value training epoch 5 batch size 128 learning rate 5 10 6 learning rate scheduler constant ... Table 4: Configurations of RL training Configurations value batch size 256 learning rate 1 10 6 learning rate scheduler constant rollout number per query N 32 rollout temperature 1.0 rollout backend vllm [102] KL coefficient β 0.001