reproducibilityindex.ai

GPTSwarm: Language Agents as Optimizable Graphs

Authors: Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, Jürgen Schmidhuber

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate that our framework can be used to efficiently develop, integrate, and automatically improve various LLM agents. The code can be found here. 1. Introduction
Researcher Affiliation	Academia	1AI Initiative, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia 2The Swiss AI Lab IDSIA, USI, SUPSI, Lugano, Switzerland.
Pseudocode	Yes	Algorithm 1 Graph Execution Require: Computational graph G = (N, E, F, o), input x, empty context z for each node without predecessors. for n in Topological Sort(N) do zn {fv(zv, x) : v pre(n)} end for Ensure: fo(zo, x)
Open Source Code	Yes	The code can be found here. https://gptswarm.org
Open Datasets	Yes	We conducted this experiment using the 4-choice MMLU general knowledge question answering dataset, as detailed by Hendrycks et al. (2021b;a). ... We conduct our evaluation on the Mini Crosswords dataset1. ... We also test the Human Eval dataset (Chen et al., 2021)... Using this benchmark, we evaluate the general applicability of our framework. We construct swarms with multiple agents of the same type and employ self-consistency (a prompt-based majority vote) for the final decision (Wang et al., 2022).
Dataset Splits	Yes	The scores are derived from evaluating the initial 10% of the MMLU validation set. ... A subset of 20 problems is used to optimize and evaluate our methods... We optimize our composite graph of agents using the REINFORCE (Alg. 2)... After each iteration, the optimized solution is evaluated on the entire dataset. ...Table 2. Ablations on the GAIA benchmark (Level 1 validation set) (Mialon et al., 2023).
Hardware Specification	No	The paper specifies the LLM models used (e.g., "GPT4-Turbo", "GPT-3.5-Turbo", "gpt-4-1106-preview", "gpt-3.5-turbo-1106"), but does not specify the underlying hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies	No	The paper mentions that "The GPTSwarm framework is developed using Python and Py Torch" but does not specify version numbers for these or other key software libraries needed for reproducibility.
Experiment Setup	Yes	The edge optimization process applies REINFORCE (Alg. 2) for 200 iterations. Each iteration assesses four graph samples, each on a specific problem sourced from the MMLU dev set. In all experiments, we used GPT4-Turbo with the token sampling temperature 0.2. ... we optimize and evaluate graphs with the GPT-3.5-Turbo language model, where the temperature is set to zero. ... We use the Adam optimizer with a learning rate of 0.1 to update the logit parameters associated with each potential edge.