Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ARIA: Training Language Agents with Intention-driven Reward Aggregation

Authors: Ruihan Yang, yikai zhang, Aili Chen, Xintao Wang, Jiangjie Chen, Siyu Yuan, Deqing Yang, Yanghua Xiao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that ARIA not only significantly reduces policy gradient variance, but also delivers substantial performance gains of an average of 9.95% across four downstream tasks, consistently outperforming offline and online RL baselines.
Researcher Affiliation	Collaboration	1, School of Data Science,Fudan University, Fudan University Byte Dance Seed 2, College of Computer Science and Artifcial Intelligence, Fudan University EMAIL EMAIL {jiangjiec}@bytedance.com
Pseudocode	Yes	We illustrate the process of the Hierarchical Agglomerative Clustering (HAC) algorithm in Algorithm 1. Algorithm 1 Hierarchical Agglomerative Clustering (HAC) with Average Linkage
Open Source Code	No	The paper lists a "Project Page: https://aria-agent.github.io" but does not contain an explicit statement of code release for the methodology described in the paper, nor a direct link to a code repository.
Open Datasets	Yes	We evaluate ARIA in both single-agent and adversarial environments (see Appendix H for details). For the single-agent setting, we consider two tasks: 1) Twenty Questions [8], a dialogue task... 2) Guess My City [8], a similar multi-turn task... For the adversarial setting, we consider two competitive tasks: 1) Bargaining [43]... 2) Negotiation [43].
Dataset Splits	Yes	For offline methods, we collect 1,000 trajectories in the single-agent scenario and 2,000 trajectories in the adversarial scenario... For the single-agent environments... We set N = 200 for each environment. For the adversarial environments... with each setting repeated N = 25 times.
Hardware Specification	Yes	All experiments are conducted using 8 NVIDIA A100-80GB GPUs.
Software Dependencies	Yes	Models We use Llama-3-8B-Instruct [44] as the policy model. For each language action, we obtain its semantic embedding using text-embedding-3-small [45]. Additional ablation results on alternative embedding models are provided in Appendix K. In single-agent environments, Oracle is simulated with GPT-4. In adversarial settings, we employ opponent models from different families, including GPT-4o (gpt-4o-2024-08-06) [46], Claude 3 (claude-3-5-sonnet-20240620) [47], and Deep Seek-Chat (Deep Seek-V3) [48].
Experiment Setup	Yes	The hyperparameter configurations for all experiments are detailed in Table 4. Table 4: Hyperparameters for All Experiments, Adversarial, Single-Agent: actor lr, batch size, number of epoch, cutoff length, kl coefficient, rollout trajectories, replay buffer size, critic lr, critic updates per iteration, actor updates per iteration, warm up iters with no actor update, iteration, group size, update per, cutoff length.