reproducibilityindex.ai

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Authors: Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, Karthik Narasimhan

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that To T significantly enhances language models problem-solving abilities on three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords.
Researcher Affiliation	Collaboration	Shunyu Yao Princeton University Dian Yu Google Deep Mind Jeffrey Zhao Google Deep Mind Izhak Shafran Google Deep Mind Thomas L. Griffiths Princeton University Yuan Cao Google Deep Mind Karthik Narasimhan Princeton University
Pseudocode	Yes	Algorithm 1 To T-BFS(x, pθ, G, k, V, T, b) and Algorithm 2 To T-DFS(s, t, pθ, G, k, V, T, vth)
Open Source Code	Yes	All code is available at https://github.com/princeton-nlp/tree-of-thought-llm.
Open Datasets	No	The paper mentions scraping data from websites like "4nums.com", "randomwordgenerator.com", and "Goo Bix", but it does not provide concrete access information (specific links, DOIs, or formal citations) for the exact processed datasets used in the experiments.
Dataset Splits	No	The paper specifies subsets used for testing in each task but does not provide explicit details about training/validation/test splits, percentages, or absolute sample counts for all splits, nor does it mention cross-validation.
Hardware Specification	No	The paper states that experiments were performed using "Chat Completion mode GPT-4" but does not specify any hardware details such as GPU models, CPU types, or memory used to run these models or interact with their APIs.
Software Dependencies	No	The paper mentions using GPT-4 and GPT-3.5-turbo models, but it does not provide specific version numbers for any ancillary software libraries, frameworks, or development environments used (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	Unless otherwise stated, we perform experiments using a Chat Completion mode GPT-41 with a sampling temperature of 0.7. (Section 4) We perform a breadth-first search (BFS) in To T, where at each step we keep the best b = 5 candidates. (Section 4.1) The LM first generates k = 5 plans and votes for the best one... then similarly generate k = 5 passages... Here the breadth limit b = 1... A simple zero-shot vote prompt... is used to sample 5 votes... (Section 4.2) We limit DFS search steps to 100... (Section 4.3)