Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Group-in-Group Policy Optimization for LLM Agent Training

Authors: Lang Feng, Zhenghai Xue, Tingcong Liu, Bo An

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present empirical evaluations of Gi GPO across a variety of agentic tasks. Specifically, we aim to demonstrate: (1) the strong ability of Gi GPO in training LLM agents; (2) the ablation study of Gi GPO; (3) the dynamic trend of step-level group GS( s) over the course of training; (4) the computational budget of Gi GPO.
Researcher Affiliation Collaboration Lang Feng1 Zhenghai Xue1 Tingcong Liu1 Bo An1,2, 1Nanyang Technological University, Singapore 2Skywork AI, Singapore EMAIL, EMAIL
Pseudocode Yes Algorithm 1 summarizes the full Gi GPO training procedure. Compared to vanilla GRPO, we highlight the additional parts introduced by Gi GPO in italics. In particular, building step-level groups GS( s) is implemented by treating anchor states as keys and aggregating corresponding data into a hash table, which incurs minimal overhead. Furthermore, computing step relative advantages and combining advantages involve only simple arithmetic operations, both of which are lightweight. As such, Gi GPO preserves the critic-free, low-memory, and stable convergence properties of group-based RL, while introducing fine-grained credit assignment that is particularly beneficial for training long-horizon LLM agents.
Open Source Code Yes Corresponding author Code: https://github.com/langfeng Q/verl-agent
Open Datasets Yes We evaluate Gi GPO on challenging agent benchmarks, including ALFWorld [5] and Web Shop [22], as well as tool-integrated reasoning on search-augmented QA tasks, including single-hop QA datasets (NQ [62], Trivia QA [63], and Pop QA [64]) and multi-hop QA datasets (Hotpot QA [65], 2Wiki [66], Mu Si Que [67], and Bamboogle [68]).
Dataset Splits Yes Hyperparameters for ALFWorld. All methods are configured with identical hyperparameters: ... The rollout group size N for group-based RL methods is set to 8. For search-augmented QA, we follow the same settings in Search-R1 [58]. We set the train data size to 256 and use a group size of 5.
Hardware Specification Yes Computing Details. For ALFWorld and Web Shop, Qwen2.5-1.5B experiments are run on 2 H100 GPUs and Qwen2.5-7B on 4 H100 GPUs, each for 150 iterations. For search-augmented QA, Qwen2.5-3B uses 4 H100 GPUs and Qwen2.5-7B uses 8 H100 GPUs, each for 200 iterations.
Software Dependencies No The paper mentions Qwen2.5-1.5B/3B/7B-Instruct [3], Qwen2.5-VL-3B-Instruct [78], and E5 [70] as a retriever, but does not provide specific version numbers for any software libraries or frameworks. This prevents replication based on exact software environments.
Experiment Setup Yes Hyperparameters for ALFWorld. All methods are configured with identical hyperparameters: the maximum prompt length is 2048 tokens, and the maximum response length is 512 tokens. Each episode allows up to 50 environment steps. The learning rate is set to 1e-6 for the actor and 1e-5 for the critic (used only in PPO). We adopt a rule-based reward, assigning a reward of 10 for success and 0 for failure. To handle invalid actions generated by the agent, we apply a reward penalty of -0.1. For all group-based RL methods, we use a group size of 8 and sample 16 different groups per rollout, resulting in a total of 16 8 = 128 environments. In contrast, PPO uses 128 separate environments for rollouts. The rollout temperature is set to 1.0, while the validation temperature is set to 0.4. The mini-batch size is 256, and the KL-divergence loss coefficient is set to 0.01. For Gi GPO, the weighting coefficient ω is fixed at 1 without further tuning, and the discount factor γ is set to 0.95.