Toward Adaptive Reasoning in Large Language Models with Thought Rollback
Authors: Sijia Chen, Baochun Li
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments on mathematical problems and multi-task reasoning demonstrate the state-of-the-art performance of TR in terms of problem-solving rate and interaction cost. For instance, the solving rate of GPT-4 with TR outperforms the current best by 9% on the MATH dataset. We conduct experiments on two streams of tasks. For the mathematical problems, we evaluate the performance of TR on test sets of GSM8K1319 (Cobbe et al., 2021), SVAMP300 (Patel et al., 2021), AQUA-RAT254 (Ling et al., 2017), MATH900 (Hendrycks et al., 2021b), Theorem QA400 (Chen et al., 2023b) datasets, where numerical subscripts indicate sample size. |
| Researcher Affiliation | Academia | Sijia Chen 1 Baochun Li 1 1Department of Electrical and Computer Engineering, University of Toronto, Toronto, Ontario, Canada. Correspondence to: Sijia Chen <sjia.chen@mail.utoronto.ca>. |
| Pseudocode | No | The paper describes the framework components and processes in prose and uses figures for illustration, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | The source code is available under the folder examples/Thought Rollback of https:// github.com/i Qua/llmpebase. |
| Open Datasets | Yes | For the mathematical problems, we evaluate the performance of TR on test sets of GSM8K1319 (Cobbe et al., 2021), SVAMP300 (Patel et al., 2021), AQUA-RAT254 (Ling et al., 2017), MATH900 (Hendrycks et al., 2021b), Theorem QA400 (Chen et al., 2023b) datasets, where numerical subscripts indicate sample size. |
| Dataset Splits | No | The paper mentions evaluating on 'test sets' and extracting CoT examples from the 'trainset' but does not provide specific details on the train/validation/test dataset splits (e.g., percentages, sample counts, or explicit splitting methodology) for reproducibility. |
| Hardware Specification | No | The paper mentions using 'GPT-3.5-turbo', 'GPT-4', and 'Llama2' models, but it does not provide specific details on the hardware (e.g., GPU/CPU models, memory) used to run the experiments. |
| Software Dependencies | No | The code is written in Python and imports the datasets from Hugging Face to build Py Torch s data loader. However, specific version numbers for Python, PyTorch, Hugging Face libraries, or any other software dependencies are not provided. |
| Experiment Setup | Yes | For LLMs with TR, the default settings for temperature and top p are 0.7. |