Toward Adaptive Reasoning in Large Language Models with Thought Rollback

Authors: Sijia Chen, Baochun Li

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments on mathematical problems and multi-task reasoning demonstrate the state-of-the-art performance of TR in terms of problem-solving rate and interaction cost. For instance, the solving rate of GPT-4 with TR outperforms the current best by 9% on the MATH dataset. We conduct experiments on two streams of tasks. For the mathematical problems, we evaluate the performance of TR on test sets of GSM8K1319 (Cobbe et al., 2021), SVAMP300 (Patel et al., 2021), AQUA-RAT254 (Ling et al., 2017), MATH900 (Hendrycks et al., 2021b), Theorem QA400 (Chen et al., 2023b) datasets, where numerical subscripts indicate sample size.
Researcher Affiliation Academia Sijia Chen 1 Baochun Li 1 1Department of Electrical and Computer Engineering, University of Toronto, Toronto, Ontario, Canada. Correspondence to: Sijia Chen <sjia.chen@mail.utoronto.ca>.
Pseudocode No The paper describes the framework components and processes in prose and uses figures for illustration, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes The source code is available under the folder examples/Thought Rollback of https:// github.com/i Qua/llmpebase.
Open Datasets Yes For the mathematical problems, we evaluate the performance of TR on test sets of GSM8K1319 (Cobbe et al., 2021), SVAMP300 (Patel et al., 2021), AQUA-RAT254 (Ling et al., 2017), MATH900 (Hendrycks et al., 2021b), Theorem QA400 (Chen et al., 2023b) datasets, where numerical subscripts indicate sample size.
Dataset Splits No The paper mentions evaluating on 'test sets' and extracting CoT examples from the 'trainset' but does not provide specific details on the train/validation/test dataset splits (e.g., percentages, sample counts, or explicit splitting methodology) for reproducibility.
Hardware Specification No The paper mentions using 'GPT-3.5-turbo', 'GPT-4', and 'Llama2' models, but it does not provide specific details on the hardware (e.g., GPU/CPU models, memory) used to run the experiments.
Software Dependencies No The code is written in Python and imports the datasets from Hugging Face to build Py Torch s data loader. However, specific version numbers for Python, PyTorch, Hugging Face libraries, or any other software dependencies are not provided.
Experiment Setup Yes For LLMs with TR, the default settings for temperature and top p are 0.7.