Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ARIA: Training Language Agents with Intention-driven Reward Aggregation

Authors: Ruihan Yang, yikai zhang, Aili Chen, Xintao Wang, Jiangjie Chen, Siyu Yuan, Deqing Yang, Yanghua Xiao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that ARIA not only significantly reduces policy gradient variance, but also delivers substantial performance gains of an average of 9.95% across four downstream tasks, consistently outperforming offline and online RL baselines.
Researcher Affiliation Collaboration 1, School of Data Science,Fudan University, Fudan University Byte Dance Seed 2, College of Computer Science and Artifcial Intelligence, Fudan University EMAIL EMAIL {jiangjiec}@bytedance.com
Pseudocode Yes We illustrate the process of the Hierarchical Agglomerative Clustering (HAC) algorithm in Algorithm 1. Algorithm 1 Hierarchical Agglomerative Clustering (HAC) with Average Linkage
Open Source Code No The paper lists a "Project Page: https://aria-agent.github.io" but does not contain an explicit statement of code release for the methodology described in the paper, nor a direct link to a code repository.
Open Datasets Yes We evaluate ARIA in both single-agent and adversarial environments (see Appendix H for details). For the single-agent setting, we consider two tasks: 1) Twenty Questions [8], a dialogue task... 2) Guess My City [8], a similar multi-turn task... For the adversarial setting, we consider two competitive tasks: 1) Bargaining [43]... 2) Negotiation [43].
Dataset Splits Yes For offline methods, we collect 1,000 trajectories in the single-agent scenario and 2,000 trajectories in the adversarial scenario... For the single-agent environments... We set N = 200 for each environment. For the adversarial environments... with each setting repeated N = 25 times.
Hardware Specification Yes All experiments are conducted using 8 NVIDIA A100-80GB GPUs.
Software Dependencies Yes Models We use Llama-3-8B-Instruct [44] as the policy model. For each language action, we obtain its semantic embedding using text-embedding-3-small [45]. Additional ablation results on alternative embedding models are provided in Appendix K. In single-agent environments, Oracle is simulated with GPT-4. In adversarial settings, we employ opponent models from different families, including GPT-4o (gpt-4o-2024-08-06) [46], Claude 3 (claude-3-5-sonnet-20240620) [47], and Deep Seek-Chat (Deep Seek-V3) [48].
Experiment Setup Yes The hyperparameter configurations for all experiments are detailed in Table 4. Table 4: Hyperparameters for All Experiments, Adversarial, Single-Agent: actor lr, batch size, number of epoch, cutoff length, kl coefficient, rollout trajectories, replay buffer size, critic lr, critic updates per iteration, actor updates per iteration, warm up iters with no actor update, iteration, group size, update per, cutoff length.