reproducibilityindex.ai

Graph of Thoughts: Solving Elaborate Problems with Large Language Models

Authors: Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, Torsten Hoefler

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Go T and show its advantages over the state of the art (contribution #4). Overall, we observe that Go T is particularly well-suited for tasks that can be naturally decomposed into smaller subtasks that are solved individually and then merged for a ﬁnal solution. Here, Go T outperforms other schemes, for example improving upon Co T and To T by, respectively, 70% and 62%, in terms of the quality of sorting, while simultaneously reducing costs by >31% over To T.
Researcher Affiliation	Collaboration	1ETH Zurich 2Warsaw University of Technology 3Cledar
Pseudocode	Yes	The Go T architecture consists of a set of interacting modules, see Figure 2 (the blue part). These modules are the Prompter (prepares the messages for the LLM), the Parser (extracts information from LLM thoughts), the Scoring module (veriﬁes and scores the LLM thoughts), and the Controller (coordinates the entire reasoning process, and decides on how to progress it). Figure 2 also provides 'API for Prompter (extensible)', 'API for Controller', 'API for Parser (extensible)', and 'Available operations when building the Go O (extensible)' which list function names and parameters, providing a pseudocode-like description of the system's operations.
Open Source Code	Yes	Website & Code: https://github.com/spcl/graph-of-thoughts
Open Datasets	No	The paper mentions using '100 input samples for each task' and describes the types of tasks (sorting, set operations, keyword counting, document merging). However, it does not specify if these are existing public datasets, provide links, DOIs, or formal citations for accessing the data used in the experiments. It seems to imply custom generated data.
Dataset Splits	No	The paper states 'We use 100 input samples for each task and comparison baseline' but does not specify any training, validation, or test dataset splits, percentages, or absolute sample counts for each split. It does not mention cross-validation or predefined splits.
Hardware Specification	No	The acknowledgements mention 'access to the Ault and Daint machines' which are supercomputers. However, these are general computing resources and the paper does not specify particular CPU models, GPU models, memory, or specific configurations of these machines that were used for the experiments.
Software Dependencies	No	The paper states 'Due to budget restrictions, we focus on GPT-3.5. We also experimented with Llama 2'. These are language models, not specific software dependencies or libraries with version numbers (e.g., Python, PyTorch, TensorFlow versions) that would be needed for reproducibility.
Experiment Setup	Yes	We use 100 input samples for each task and comparison baseline. We set the temperature to 1.0 and use a 4k context size unless stated otherwise. Parameters We experiment extensively with the branching factor k and the number of levels L to ensure that we compare Go T to cost-effective and advantageous conﬁgurations.