Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MetaAgent: Automatically Constructing Multi-Agent Systems Based on Finite State Machines

Authors: Yaolun Zhang, Xiaogeng Liu, Chaowei Xiao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate our framework, we conduct experiments on both text-based tasks and practical tasks. The results indicate that the generated multi-agent system surpasses other auto-designed methods and can achieve a comparable performance with the human-designed multi-agent system, which is optimized for those specific tasks. The code can be found at: https://github.com/Sa Fo Lab WISC/Meta Agent/.
Researcher Affiliation Academia 1University of Wisconsin Madison, Madison, US. Correspondence to: Chaowei Xiao <EMAIL>.
Pseudocode Yes Algorithm 1 FSM State Optimization Algorithm 2 Deployment Stage
Open Source Code Yes The code can be found at: https://github.com/Sa Fo Lab WISC/Meta Agent/.
Open Datasets Yes Firstly, we compare Meta Agent with other prompt-based methods on Trivial Creative Writing (Wang et al., 2024d) and GPQA (Rein et al., 2023). Machine Learning Bench(ml bench) (Hong et al., 2024a) is a benchmark that requires agents to train a machine-learning model for regression or classification.
Dataset Splits Yes # Load the dataset train_data_path = /Users/a11/Desktop/Meta Agent/Meta Agent/ml_benchmark/04_titanic/split_train.csv eval_data_path = /Users/a11/Desktop/Meta Agent/Meta Agent/ml_benchmark/04_titanic/split_eval.csv
Hardware Specification No The paper does not explicitly describe any specific hardware used to run its experiments, such as GPU or CPU models. It only mentions the foundation model used (GPT-4o).
Software Dependencies No The paper lists several software libraries used in the code example (pandas, sklearn, etc.) but does not provide specific version numbers for these dependencies, which is required for reproducibility.
Experiment Setup Yes We selected GPT-4o as the foundation model in the main experiments and set the temperature to 0 to ensure reproducibility. model = RandomForestClassifier(n_estimators=100, random_state=0) model = Pipeline(steps=[ (preprocessor, preprocessor), (classifier, RandomForestClassifier(random_state=42)) ])