Graph of Thoughts: Solving Elaborate Problems with Large Language Models

Authors: Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, Torsten Hoefler

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Go T and show its advantages over the state of the art (contribution #4). Overall, we observe that Go T is particularly well-suited for tasks that can be naturally decomposed into smaller subtasks that are solved individually and then merged for a final solution. Here, Go T outperforms other schemes, for example improving upon Co T and To T by, respectively, 70% and 62%, in terms of the quality of sorting, while simultaneously reducing costs by >31% over To T.
Researcher Affiliation Collaboration 1ETH Zurich 2Warsaw University of Technology 3Cledar
Pseudocode Yes The Go T architecture consists of a set of interacting modules, see Figure 2 (the blue part). These modules are the Prompter (prepares the messages for the LLM), the Parser (extracts information from LLM thoughts), the Scoring module (verifies and scores the LLM thoughts), and the Controller (coordinates the entire reasoning process, and decides on how to progress it). Figure 2 also provides 'API for Prompter (extensible)', 'API for Controller', 'API for Parser (extensible)', and 'Available operations when building the Go O (extensible)' which list function names and parameters, providing a pseudocode-like description of the system's operations.
Open Source Code Yes Website & Code: https://github.com/spcl/graph-of-thoughts
Open Datasets No The paper mentions using '100 input samples for each task' and describes the types of tasks (sorting, set operations, keyword counting, document merging). However, it does not specify if these are existing public datasets, provide links, DOIs, or formal citations for accessing the data used in the experiments. It seems to imply custom generated data.
Dataset Splits No The paper states 'We use 100 input samples for each task and comparison baseline' but does not specify any training, validation, or test dataset splits, percentages, or absolute sample counts for each split. It does not mention cross-validation or predefined splits.
Hardware Specification No The acknowledgements mention 'access to the Ault and Daint machines' which are supercomputers. However, these are general computing resources and the paper does not specify particular CPU models, GPU models, memory, or specific configurations of these machines that were used for the experiments.
Software Dependencies No The paper states 'Due to budget restrictions, we focus on GPT-3.5. We also experimented with Llama 2'. These are language models, not specific software dependencies or libraries with version numbers (e.g., Python, PyTorch, TensorFlow versions) that would be needed for reproducibility.
Experiment Setup Yes We use 100 input samples for each task and comparison baseline. We set the temperature to 1.0 and use a 4k context size unless stated otherwise. Parameters We experiment extensively with the branching factor k and the number of levels L to ensure that we compare Go T to cost-effective and advantageous configurations.