Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MetaAgent: Automatically Constructing Multi-Agent Systems Based on Finite State Machines

Authors: Yaolun Zhang, Xiaogeng Liu, Chaowei Xiao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate our framework, we conduct experiments on both text-based tasks and practical tasks. The results indicate that the generated multi-agent system surpasses other auto-designed methods and can achieve a comparable performance with the human-designed multi-agent system, which is optimized for those specific tasks. The code can be found at: https://github.com/Sa Fo Lab WISC/Meta Agent/.
Researcher Affiliation	Academia	1University of Wisconsin Madison, Madison, US. Correspondence to: Chaowei Xiao <EMAIL>.
Pseudocode	Yes	Algorithm 1 FSM State Optimization Algorithm 2 Deployment Stage
Open Source Code	Yes	The code can be found at: https://github.com/Sa Fo Lab WISC/Meta Agent/.
Open Datasets	Yes	Firstly, we compare Meta Agent with other prompt-based methods on Trivial Creative Writing (Wang et al., 2024d) and GPQA (Rein et al., 2023). Machine Learning Bench(ml bench) (Hong et al., 2024a) is a benchmark that requires agents to train a machine-learning model for regression or classification.
Dataset Splits	Yes	# Load the dataset train_data_path = /Users/a11/Desktop/Meta Agent/Meta Agent/ml_benchmark/04_titanic/split_train.csv eval_data_path = /Users/a11/Desktop/Meta Agent/Meta Agent/ml_benchmark/04_titanic/split_eval.csv
Hardware Specification	No	The paper does not explicitly describe any specific hardware used to run its experiments, such as GPU or CPU models. It only mentions the foundation model used (GPT-4o).
Software Dependencies	No	The paper lists several software libraries used in the code example (pandas, sklearn, etc.) but does not provide specific version numbers for these dependencies, which is required for reproducibility.
Experiment Setup	Yes	We selected GPT-4o as the foundation model in the main experiments and set the temperature to 0 to ensure reproducibility. model = RandomForestClassifier(n_estimators=100, random_state=0) model = Pipeline(steps=[ (preprocessor, preprocessor), (classifier, RandomForestClassifier(random_state=42)) ])