reproducibilityindex.ai

AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors

Authors: Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, Jie Zhou

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that AGENTVERSE can proficiently deploy multi-agent groups that outperform a single agent. Extensive experiments on text understanding, reasoning, coding, tool utilization, and embodied AI confirm the effectiveness of AGENTVERSE.
Researcher Affiliation	Collaboration	Weize Chen1 , Yusheng Su1 , Jingwei Zuo1, Cheng Yang2B, Chenfei Yuan1, Chi-Min Chan1, Heyang Yu1, Yaxi Lu1, Yi-Hsin Hung1, Chen Qian1, Yujia Qin1, Xin Cong1, Ruobing Xie3, Zhiyuan Liu1B, Maosong Sun1, Jie Zhou3 1 Tsinghua University 2 Beijing University of Posts and Telecommunications 3 Pattern Recognition Center, We Chat AI, Tencent Inc.
Pseudocode	No	The paper includes Python code examples (Figures 13 and 14) but does not present any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	We will release our codebase, AGENTVERSE, to further facilitate multi-agent research.
Open Datasets	Yes	To assess the agents general understanding and reasoning capabilities, we use four datasets: FED (Mehri & Esk enazi, 2020), Commongen Challenge (Madaan et al., 2023), MGSM (Shi et al., 2023), and Logic Grid Puzzles (Srivastava et al., 2022).
Dataset Splits	No	The paper lists the datasets used (FED, Commongen Challenge, MGSM, Logic Grid Puzzles, Humaneval) but does not provide explicit details on training, validation, or testing splits (e.g., percentages, sample counts, or specific references to predefined splits) beyond stating they are used for evaluation.
Hardware Specification	No	The paper does not explicitly describe the hardware (e.g., specific GPU or CPU models, memory, or cloud resources) used to run its experiments.
Software Dependencies	Yes	In all the experiments, we evaluate the performance of agents driven by GPT-3.5-Turbo-0613 and GPT-4-0613 across various tasks. [...] We employ the checkpoint available in the official repository3, and use GPT-4-0314 as the backbone LLM for Voyager agent to be consistent with Wang et al. (2023a).
Experiment Setup	Yes	All the experiments are done in zero-shot setting. For tasks including dialogue response, code completion, and constrained generation, four agents is recruited into the system. For the task of mathematical reasoning, we limited the number to two agents. [...] For tool utilization, we recruit two or three agents to engage in collaborative decision-making and action execution depending on the specific task. For tasks in coding and general understanding and reasoning, we use the vertical structure because all these tasks require only one response as the answer, and the solver in the vertical structure can be responsible for answering. For tool utilization, we use the horizontal structure because the agents should clarify their own sub-tasks in the discussion. [...] The process concludes either when the agent finalizes its execution with its conclusion or after a pre-set maximum number of iterations set (10 in our experiments).