Graph of Thoughts: Solving Elaborate Problems with Large Language Models
Authors: Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, Torsten Hoefler
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Go T and show its advantages over the state of the art (contribution #4). Overall, we observe that Go T is particularly well-suited for tasks that can be naturally decomposed into smaller subtasks that are solved individually and then merged for a final solution. Here, Go T outperforms other schemes, for example improving upon Co T and To T by, respectively, 70% and 62%, in terms of the quality of sorting, while simultaneously reducing costs by >31% over To T. |
| Researcher Affiliation | Collaboration | 1ETH Zurich 2Warsaw University of Technology 3Cledar |
| Pseudocode | Yes | The Go T architecture consists of a set of interacting modules, see Figure 2 (the blue part). These modules are the Prompter (prepares the messages for the LLM), the Parser (extracts information from LLM thoughts), the Scoring module (verifies and scores the LLM thoughts), and the Controller (coordinates the entire reasoning process, and decides on how to progress it). Figure 2 also provides 'API for Prompter (extensible)', 'API for Controller', 'API for Parser (extensible)', and 'Available operations when building the Go O (extensible)' which list function names and parameters, providing a pseudocode-like description of the system's operations. |
| Open Source Code | Yes | Website & Code: https://github.com/spcl/graph-of-thoughts |
| Open Datasets | No | The paper mentions using '100 input samples for each task' and describes the types of tasks (sorting, set operations, keyword counting, document merging). However, it does not specify if these are existing public datasets, provide links, DOIs, or formal citations for accessing the data used in the experiments. It seems to imply custom generated data. |
| Dataset Splits | No | The paper states 'We use 100 input samples for each task and comparison baseline' but does not specify any training, validation, or test dataset splits, percentages, or absolute sample counts for each split. It does not mention cross-validation or predefined splits. |
| Hardware Specification | No | The acknowledgements mention 'access to the Ault and Daint machines' which are supercomputers. However, these are general computing resources and the paper does not specify particular CPU models, GPU models, memory, or specific configurations of these machines that were used for the experiments. |
| Software Dependencies | No | The paper states 'Due to budget restrictions, we focus on GPT-3.5. We also experimented with Llama 2'. These are language models, not specific software dependencies or libraries with version numbers (e.g., Python, PyTorch, TensorFlow versions) that would be needed for reproducibility. |
| Experiment Setup | Yes | We use 100 input samples for each task and comparison baseline. We set the temperature to 1.0 and use a 4k context size unless stated otherwise. Parameters We experiment extensively with the branching factor k and the number of levels L to ensure that we compare Go T to cost-effective and advantageous configurations. |