Boosting of Thoughts: Trial-and-Error Problem Solving with Large Language Models
Authors: Sijia Chen, Baochun Li, Di Niu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments with GPT-4 and Llama2 across extensive complex mathematical problems demonstrate that Bo T consistently achieves higher or comparable problem-solving rates than other advanced prompting approaches. The problem-solving rates indicate that Bo T, employing binary tree thought structures, significantly surpasses the current state-of-the-art on the GSM8K and AQu A while achieving the second-best results on other datasets. |
| Researcher Affiliation | Academia | Sijia Chen, Baochun Li Department of Electrical and Computer Engineering University of Toronto Toronto, Ontario, Canada sjia.chen@mail.utoronto.ca, bli@ece.toronto.edu Di Niu Department of Electrical and Computer Engineering, University of Alberta Edmonton, Alberta, Canada dniu@ualberta.ca |
| Pseudocode | Yes | Algorithm 1: Main reasoning pipeline of Bo T Input: Number of iterations T, Number of tree structures M, Question Q. Output: Aggregated chain z T 1...n. Algorithm 2: Best-First Aggregation and Greedy aggregation |
| Open Source Code | Yes | The source code is available under the folder examples/Bo TReasoning of https://github.com/i Qua/llmpebase. |
| Open Datasets | Yes | Experiments are performed on benchmark datasets with diverse mathematical problems, including MMLU Hendrycks et al. (2021a), SVAMP Patel et al. (2021), GSM8K Cobbe et al. (2021), AQu A Ling et al. (2017) and MATH Hendrycks et al. (2021b). Besides, we include a challenging mathematical reasoning task, Game of 24 Yao et al. (2024), where the goal is to use four numbers and basic arithmetic operations (addition, subtraction, multiplication, and division) to obtain 24 in 1 equation. |
| Dataset Splits | No | The paper does not explicitly provide specific training/validation/test splits (e.g., percentages or exact counts) for the datasets used. |
| Hardware Specification | No | The paper mentions using GPT-4 via Open AI APIs and Llama2 (llama2-13b-chat) locally, but it does not specify any particular hardware components like CPU/GPU models or memory details used for the local experiments. |
| Software Dependencies | No | The paper mentions GPT-4 and Llama2 models but does not provide specific version numbers for any software dependencies or libraries (e.g., Python version, PyTorch/TensorFlow versions). |
| Experiment Setup | Yes | If not explicitly stated, Bo T, in all experiments, performs T = 10 iterations of running and builds M = 15 thought structures, each being a weighted binary tree because this tends to achieve optimal results. Besides, for those benchmark datasets, we set the depth of the tree to be 5 while the corresponding depth in Game of 24 is 3. To construct the heterogeneous tree thought structures, Bo T randomly chooses the temperature from the range of [0.2, 0.4, 0.6, 0.7, 0.9, 1.1, 1.5] and the top p from the range of [0.1, 0.3, 0.5, 0.7, 0.9]. |