reproducibilityindex.ai

Boosting of Thoughts: Trial-and-Error Problem Solving with Large Language Models

Authors: Sijia Chen, Baochun Li, Di Niu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments with GPT-4 and Llama2 across extensive complex mathematical problems demonstrate that Bo T consistently achieves higher or comparable problem-solving rates than other advanced prompting approaches. The problem-solving rates indicate that Bo T, employing binary tree thought structures, significantly surpasses the current state-of-the-art on the GSM8K and AQu A while achieving the second-best results on other datasets.
Researcher Affiliation	Academia	Sijia Chen, Baochun Li Department of Electrical and Computer Engineering University of Toronto Toronto, Ontario, Canada sjia.chen@mail.utoronto.ca, bli@ece.toronto.edu Di Niu Department of Electrical and Computer Engineering, University of Alberta Edmonton, Alberta, Canada dniu@ualberta.ca
Pseudocode	Yes	Algorithm 1: Main reasoning pipeline of Bo T Input: Number of iterations T, Number of tree structures M, Question Q. Output: Aggregated chain z T 1...n. Algorithm 2: Best-First Aggregation and Greedy aggregation
Open Source Code	Yes	The source code is available under the folder examples/Bo TReasoning of https://github.com/i Qua/llmpebase.
Open Datasets	Yes	Experiments are performed on benchmark datasets with diverse mathematical problems, including MMLU Hendrycks et al. (2021a), SVAMP Patel et al. (2021), GSM8K Cobbe et al. (2021), AQu A Ling et al. (2017) and MATH Hendrycks et al. (2021b). Besides, we include a challenging mathematical reasoning task, Game of 24 Yao et al. (2024), where the goal is to use four numbers and basic arithmetic operations (addition, subtraction, multiplication, and division) to obtain 24 in 1 equation.
Dataset Splits	No	The paper does not explicitly provide specific training/validation/test splits (e.g., percentages or exact counts) for the datasets used.
Hardware Specification	No	The paper mentions using GPT-4 via Open AI APIs and Llama2 (llama2-13b-chat) locally, but it does not specify any particular hardware components like CPU/GPU models or memory details used for the local experiments.
Software Dependencies	No	The paper mentions GPT-4 and Llama2 models but does not provide specific version numbers for any software dependencies or libraries (e.g., Python version, PyTorch/TensorFlow versions).
Experiment Setup	Yes	If not explicitly stated, Bo T, in all experiments, performs T = 10 iterations of running and builds M = 15 thought structures, each being a weighted binary tree because this tends to achieve optimal results. Besides, for those benchmark datasets, we set the depth of the tree to be 5 while the corresponding depth in Game of 24 is 3. To construct the heterogeneous tree thought structures, Bo T randomly chooses the temperature from the range of [0.2, 0.4, 0.6, 0.7, 0.9, 1.1, 1.5] and the top p from the range of [0.1, 0.3, 0.5, 0.7, 0.9].