AutoAgents: A Framework for Automatic Agent Generation
Authors: Guangyao Chen, Siwei Dong, Yu Shu, Ge Zhang, Jaward Sesay, Börje Karlsson, Jie Fu, Yemin Shi
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on various benchmarks demonstrate that Auto Agents generates more coherent and accurate solutions than the existing multi-agent methods. This underscores the significance of assigning different roles to different tasks and of team cooperation, offering new perspectives for tackling complex tasks. The repository of this project is available at https://github.com/Link-AGI/Auto Agents. |
| Researcher Affiliation | Collaboration | Guangyao Chen1 , Siwei Dong1 , Yu Shu1 , Ge Zhang4 , Jaward Sesay3 , Börje Karlsson3 , Jie Fu2 and Yemin Shi1 1Peking University 2Hong Kong University of Science and Technology 3Beijing Academy of Artificial Intelligence 4University of Waterloo |
| Pseudocode | Yes | Algorithm 1 Auto Agents Execution Process. Input: User task/Question Output: Task solution/Answer |
| Open Source Code | Yes | The repository of this project is available at https://github.com/Link-AGI/Auto Agents. |
| Open Datasets | Yes | We constructed a benchmark consisting of 100 instances for each N, encompassing a total of 1000 trivia questions. Evaluation Metrics. Drawing on the approach of [Wang et al., 2023c], we adopt an automatic metric to identify factual errors and measure a model s capacity to integrate diverse domain knowledge. We conduct string matching with the veridical target answers for each question on the generated output. The target answers are supplied from the Trivia QA dataset [Joshi et al., 2017], and each question can have a list of answer variants. |
| Dataset Splits | No | The paper does not explicitly state specific training/validation/test splits for all experiments. While it mentions 'The last 20 samples from a dataset of 100 samples are used as test instances' for the ablation studies, it does not provide comprehensive split details including validation sets for all experiments or reference standard splits with specific numbers. |
| Hardware Specification | No | The paper states, 'All experiments are conducted using the GPT-4 API2'. This refers to a software API for an LLM and does not provide any specific details about the underlying hardware (e.g., GPU models, CPU types, memory) used for computations. |
| Software Dependencies | Yes | All experiments are conducted using the GPT-4 API2, with the temperature set to 0 to ensure reproducibility. This model is chosen for its superior performance, providing accurate and consistent results. Its accessibility via APIs greatly facilitates our interaction with the model, streamlining our research process. During the drafting phase, a maximum of three discussions are allowed, while in the execution phase, a single agent can perform up to five self-refinements and multiple agents can collaboratively refine up to five times. |
| Experiment Setup | Yes | All experiments are conducted using the GPT-4 API2, with the temperature set to 0 to ensure reproducibility. This model is chosen for its superior performance, providing accurate and consistent results. Its accessibility via APIs greatly facilitates our interaction with the model, streamlining our research process. During the drafting phase, a maximum of three discussions are allowed, while in the execution phase, a single agent can perform up to five self-refinements and multiple agents can collaboratively refine up to five times. |