Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AgentNet: Decentralized Evolutionary Coordination for LLM-based Multi-Agent Systems

Authors: Yingxuan Yang, Huacan Chai, Shuai Shao, Yuanyi Song, Siyuan Qi, Renting Rui, Weinan Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that Agent Net achieves higher task accuracy than both single-agent and centralized multi-agent baselines.
Researcher Affiliation	Academia	Yingxuan Yang1 , Huacan Chai1 , Shuai Shao1, Yuanyi Song1, Siyuan Qi1, Renting Rui1, Weinan Zhang1,2 1Shanghai Jiao Tong University, 2Shanghai Innovation Institute EMAIL
Pseudocode	Yes	A.1 Pseudocode of Agent Net Algorithm 1 Agent Net System
Open Source Code	No	Justification: The code will be made public upon acceptance.
Open Datasets	Yes	Mathematics: This task involves mathematical problem and is evaluated using MATH [8], which includes problems with 7 different types. The training set consists of 100 examples per type (total of 700 problems), while the test set consists of 20 examples per type (total of 140 problems). Logical Question Answering: This task tests reasoning and logical question answering abilities using the BBH (Big-Bench Hard) benchmark [21]. Function-Calling: This benchmark evaluates the agent s ability to perform tool-augmented task planning and API usage, based on the API-Bank dataset [14].
Dataset Splits	Yes	Mathematics: The training set consists of 100 examples per type (total of 700 problems), while the test set consists of 20 examples per type (total of 140 problems). Logical Question Answering: The training set follows the Morph Agent setup, selecting 627 examples from 20 tasks. For testing, each task has 5 examples of varying difficulty, totaling 100 test problems. Function-Calling: We construct a training set of 100 tasks and a test set of 100 tasks, randomly sampled from the full API-Bank corpus.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. The paper states that 'The experiments are sufficiently discussed to be run by others' in the NeurIPS checklist, which is vague.
Software Dependencies	Yes	In our implementation, we configure the LLM API with a temperature of 0.0, a maximum token limit of 2048, and a top-p value of 1.0, ensuring consistent results throughout our experiments and enabling reliable comparisons and analysis. For the memory pool experiment, we utilize the 'BAAI/bge-large-en-v1.5' model to compute the similarity between task queries and database trajectories.
Experiment Setup	Yes	Parameter Configuration In our implementation, we configure the LLM API with a temperature of 0.0, a maximum token limit of 2048, and a top-p value of 1.0, ensuring consistent results throughout our experiments and enabling reliable comparisons and analysis. (Section C) And in D.2 Initial Configuration, under `experiment_config` and `default_agent_config`, it lists: `task_num: 100`, `agent_num: 3`, `forward_path_max_length: 3`, `max_execution_times: 5`, `user_react: True`, `abilities` weights, `executor_memory_limit: 40`, `embedding_cache_limit: 1000`, `router_memory_limit: -1`, `decay_rate: 0.1`, `decay_interval: 10`, `router_retrieval_num: 3`, `executor_retrieval_num: 3`.