Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ToolRL: Reward is All Tool Learning Needs

Authors: Cheng Qian, Emre Can Acikgoz, Qi He, Hongru WANG, Xiusi Chen, Dilek Hakkani-Tur, Gokhan Tur, Heng Ji

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evaluations across diverse benchmarks demonstrate that our approach yields robust, scalable, and stable training, achieving a 17% improvement over base models and a 15% gain over SFT models. All the codes are released to facilitate future research.1
Researcher Affiliation Academia University of Illinois Urbana-Champaign EMAIL
Pseudocode Yes C Algorithm Details We employ GRPO as our standard RL training setting, a variant of PPO that introduces advantage normalization within grouped samples. This normalization helps stabilize training by reducing variance across samples that share a common input context. We continue to use the symbols defined in Section 2 and further let πθ represent the current policy. Normalized Advantage Across Query Groups. For each query Q, its responses derived from the rollout form a group GQ consisting of multiple responses and their corresponding reward values: GQ = {A, (s1, r1), (s2, r2), . . . , (sn, rn)} where A denotes the ground-truth annotation for Q, and each reward ri is computed as the sum of the format and correctness rewards associated with response si, i.e., ri = Rformat(si, A) + Rcorrect(si, A). For each group, we calculate the mean and standard deviation of the rewards: i=1 ri, σQ = i=1 (ri µQ)2 Then, for each sample si in the group, we define the normalized advantage: Ai(si|Q) = ri µQ σQ + η where η is a constant to avoid division by zero. Policy Optimization Objective. The policy πθ is optimized using the standard clipped PPO objective, adapted with our group-wise normalized advantages: JGRPO(θ) = EQ DEsi πθ h min πθ(si|Q) πold(si|Q)Ai(si|Q), clip πθ(si|Q) πold(si|Q), 1 ϵ, 1 + ϵ Ai(si|Q) i Overall, this objective guides the policy to generate structurally consistent and semantically accurate tool calls, while group-wise normalization mitigates reward variance across queries, leading to more stable and sample-efficient alignment with task-specific response requirements.
Open Source Code Yes All the codes are released to facilitate future research.1 1 Data and codes released at https://github.com/qiancheng0/Tool RL
Open Datasets Yes To support robust tool learning, we construct a mixed dataset spanning diverse tool use scenarios: Tool ACE [27]: A general tool use dataset... Hammer (Masked) [25]: A subset of Hammer with randomized tool and parameter names... x LAM [61]: A compositional dataset...
Dataset Splits No The paper states: "Empirically, we sample 2K examples from Tool ACE and 1K each from Hammer and x LAM, creating a balanced dataset spanning diverse levels of complexity and tool use." This describes the construction of the training dataset but does not provide explicit training/test/validation splits for either this combined dataset or the evaluation benchmarks (BFCL, API-Bank, Bamboogle) in terms of percentages or sample counts. While it mentions evaluation on these benchmarks, it does not specify their splits for reproducibility.
Hardware Specification Yes For the GRPO training, we use 2 A100 (80G) GPUs per run with the hyper-parameters shown in Table 9.
Software Dependencies No The paper mentions using the "ve RL framework [46]" but does not provide a specific version number for this framework or any other software components like programming languages (e.g., Python), libraries (e.g., PyTorch), or CUDA versions.
Experiment Setup Yes Training. We conduct all RL experiments using the ve RL framework [46]. For each training step, we sample a batch of 512, and generate 4 responses per query, training for 15 epochs in total (see Appendix E for full configuration). To encourage policy exploration, we remove KL regularization and apply temperature 1.0. We initialize our models with the Qwen-2.5-Instruct [49] and Llama-3.2-Instruct [11] series, which are further tuned under our customized reward design. Table 9: Full configuration for GRPO training. Category Hyperparameter Data Configuration Train Batch Size 512 Validation Batch Size 128 Max Prompt Length 2048 Max Response Length 1024 Optimization Learning Rate 1e-6 PPO Mini Batch Size 128 KL Loss Used False Rollout Configuration Rollout Name vllm GPU Memory Utilization 0.6 Number of Rollouts 4 Training & Logging Save Frequency (Steps) 15 Test Frequency (Steps) 5 Total Epochs 15