Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AutoAgents: A Framework for Automatic Agent Generation

Authors: Guangyao Chen, Siwei Dong, Yu Shu, Ge Zhang, Jaward Sesay, Börje Karlsson, Jie Fu, Yemin Shi

IJCAI 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments on various benchmarks demonstrate that Auto Agents generates more coherent and accurate solutions than the existing multi-agent methods. This underscores the significance of assigning different roles to different tasks and of team cooperation, offering new perspectives for tackling complex tasks. The repository of this project is available at https://github.com/Link-AGI/Auto Agents.
Researcher Affiliation	Collaboration	Guangyao Chen1 , Siwei Dong1 , Yu Shu1 , Ge Zhang4 , Jaward Sesay3 , Börje Karlsson3 , Jie Fu2 and Yemin Shi1 1Peking University 2Hong Kong University of Science and Technology 3Beijing Academy of Artificial Intelligence 4University of Waterloo
Pseudocode	Yes	Algorithm 1 Auto Agents Execution Process. Input: User task/Question Output: Task solution/Answer
Open Source Code	Yes	The repository of this project is available at https://github.com/Link-AGI/Auto Agents.
Open Datasets	Yes	We constructed a benchmark consisting of 100 instances for each N, encompassing a total of 1000 trivia questions. Evaluation Metrics. Drawing on the approach of [Wang et al., 2023c], we adopt an automatic metric to identify factual errors and measure a model s capacity to integrate diverse domain knowledge. We conduct string matching with the veridical target answers for each question on the generated output. The target answers are supplied from the Trivia QA dataset [Joshi et al., 2017], and each question can have a list of answer variants.
Dataset Splits	No	The paper does not explicitly state specific training/validation/test splits for all experiments. While it mentions 'The last 20 samples from a dataset of 100 samples are used as test instances' for the ablation studies, it does not provide comprehensive split details including validation sets for all experiments or reference standard splits with specific numbers.
Hardware Specification	No	The paper states, 'All experiments are conducted using the GPT-4 API2'. This refers to a software API for an LLM and does not provide any specific details about the underlying hardware (e.g., GPU models, CPU types, memory) used for computations.
Software Dependencies	Yes	All experiments are conducted using the GPT-4 API2, with the temperature set to 0 to ensure reproducibility. This model is chosen for its superior performance, providing accurate and consistent results. Its accessibility via APIs greatly facilitates our interaction with the model, streamlining our research process. During the drafting phase, a maximum of three discussions are allowed, while in the execution phase, a single agent can perform up to five self-refinements and multiple agents can collaboratively refine up to five times.
Experiment Setup	Yes	All experiments are conducted using the GPT-4 API2, with the temperature set to 0 to ensure reproducibility. This model is chosen for its superior performance, providing accurate and consistent results. Its accessibility via APIs greatly facilitates our interaction with the model, streamlining our research process. During the drafting phase, a maximum of three discussions are allowed, while in the execution phase, a single agent can perform up to five self-refinements and multiple agents can collaboratively refine up to five times.