GPTSwarm: Language Agents as Optimizable Graphs

Authors: Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, Jürgen Schmidhuber

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that our framework can be used to efficiently develop, integrate, and automatically improve various LLM agents. The code can be found here. 1. Introduction
Researcher Affiliation Academia 1AI Initiative, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia 2The Swiss AI Lab IDSIA, USI, SUPSI, Lugano, Switzerland.
Pseudocode Yes Algorithm 1 Graph Execution Require: Computational graph G = (N, E, F, o), input x, empty context z for each node without predecessors. for n in Topological Sort(N) do zn {fv(zv, x) : v pre(n)} end for Ensure: fo(zo, x)
Open Source Code Yes The code can be found here. https://gptswarm.org
Open Datasets Yes We conducted this experiment using the 4-choice MMLU general knowledge question answering dataset, as detailed by Hendrycks et al. (2021b;a). ... We conduct our evaluation on the Mini Crosswords dataset1. ... We also test the Human Eval dataset (Chen et al., 2021)... Using this benchmark, we evaluate the general applicability of our framework. We construct swarms with multiple agents of the same type and employ self-consistency (a prompt-based majority vote) for the final decision (Wang et al., 2022).
Dataset Splits Yes The scores are derived from evaluating the initial 10% of the MMLU validation set. ... A subset of 20 problems is used to optimize and evaluate our methods... We optimize our composite graph of agents using the REINFORCE (Alg. 2)... After each iteration, the optimized solution is evaluated on the entire dataset. ...Table 2. Ablations on the GAIA benchmark (Level 1 validation set) (Mialon et al., 2023).
Hardware Specification No The paper specifies the LLM models used (e.g., "GPT4-Turbo", "GPT-3.5-Turbo", "gpt-4-1106-preview", "gpt-3.5-turbo-1106"), but does not specify the underlying hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies No The paper mentions that "The GPTSwarm framework is developed using Python and Py Torch" but does not specify version numbers for these or other key software libraries needed for reproducibility.
Experiment Setup Yes The edge optimization process applies REINFORCE (Alg. 2) for 200 iterations. Each iteration assesses four graph samples, each on a specific problem sourced from the MMLU dev set. In all experiments, we used GPT4-Turbo with the token sampling temperature 0.2. ... we optimize and evaluate graphs with the GPT-3.5-Turbo language model, where the temperature is set to zero. ... We use the Adam optimizer with a learning rate of 0.1 to update the logit parameters associated with each potential edge.