GPTSwarm: Language Agents as Optimizable Graphs
Authors: Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, Jürgen Schmidhuber
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate that our framework can be used to efficiently develop, integrate, and automatically improve various LLM agents. The code can be found here. 1. Introduction |
| Researcher Affiliation | Academia | 1AI Initiative, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia 2The Swiss AI Lab IDSIA, USI, SUPSI, Lugano, Switzerland. |
| Pseudocode | Yes | Algorithm 1 Graph Execution Require: Computational graph G = (N, E, F, o), input x, empty context z for each node without predecessors. for n in Topological Sort(N) do zn {fv(zv, x) : v pre(n)} end for Ensure: fo(zo, x) |
| Open Source Code | Yes | The code can be found here. https://gptswarm.org |
| Open Datasets | Yes | We conducted this experiment using the 4-choice MMLU general knowledge question answering dataset, as detailed by Hendrycks et al. (2021b;a). ... We conduct our evaluation on the Mini Crosswords dataset1. ... We also test the Human Eval dataset (Chen et al., 2021)... Using this benchmark, we evaluate the general applicability of our framework. We construct swarms with multiple agents of the same type and employ self-consistency (a prompt-based majority vote) for the final decision (Wang et al., 2022). |
| Dataset Splits | Yes | The scores are derived from evaluating the initial 10% of the MMLU validation set. ... A subset of 20 problems is used to optimize and evaluate our methods... We optimize our composite graph of agents using the REINFORCE (Alg. 2)... After each iteration, the optimized solution is evaluated on the entire dataset. ...Table 2. Ablations on the GAIA benchmark (Level 1 validation set) (Mialon et al., 2023). |
| Hardware Specification | No | The paper specifies the LLM models used (e.g., "GPT4-Turbo", "GPT-3.5-Turbo", "gpt-4-1106-preview", "gpt-3.5-turbo-1106"), but does not specify the underlying hardware (e.g., GPU models, CPU types) used for running the experiments. |
| Software Dependencies | No | The paper mentions that "The GPTSwarm framework is developed using Python and Py Torch" but does not specify version numbers for these or other key software libraries needed for reproducibility. |
| Experiment Setup | Yes | The edge optimization process applies REINFORCE (Alg. 2) for 200 iterations. Each iteration assesses four graph samples, each on a specific problem sourced from the MMLU dev set. In all experiments, we used GPT4-Turbo with the token sampling temperature 0.2. ... we optimize and evaluate graphs with the GPT-3.5-Turbo language model, where the temperature is set to zero. ... We use the Adam optimizer with a learning rate of 0.1 to update the logit parameters associated with each potential edge. |