reproducibilityindex.ai

Toward Adaptive Reasoning in Large Language Models with Thought Rollback

Authors: Sijia Chen, Baochun Li

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experiments on mathematical problems and multi-task reasoning demonstrate the state-of-the-art performance of TR in terms of problem-solving rate and interaction cost. For instance, the solving rate of GPT-4 with TR outperforms the current best by 9% on the MATH dataset. We conduct experiments on two streams of tasks. For the mathematical problems, we evaluate the performance of TR on test sets of GSM8K1319 (Cobbe et al., 2021), SVAMP300 (Patel et al., 2021), AQUA-RAT254 (Ling et al., 2017), MATH900 (Hendrycks et al., 2021b), Theorem QA400 (Chen et al., 2023b) datasets, where numerical subscripts indicate sample size.
Researcher Affiliation	Academia	Sijia Chen 1 Baochun Li 1 1Department of Electrical and Computer Engineering, University of Toronto, Toronto, Ontario, Canada. Correspondence to: Sijia Chen <sjia.chen@mail.utoronto.ca>.
Pseudocode	No	The paper describes the framework components and processes in prose and uses figures for illustration, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	The source code is available under the folder examples/Thought Rollback of https:// github.com/i Qua/llmpebase.
Open Datasets	Yes	For the mathematical problems, we evaluate the performance of TR on test sets of GSM8K1319 (Cobbe et al., 2021), SVAMP300 (Patel et al., 2021), AQUA-RAT254 (Ling et al., 2017), MATH900 (Hendrycks et al., 2021b), Theorem QA400 (Chen et al., 2023b) datasets, where numerical subscripts indicate sample size.
Dataset Splits	No	The paper mentions evaluating on 'test sets' and extracting CoT examples from the 'trainset' but does not provide specific details on the train/validation/test dataset splits (e.g., percentages, sample counts, or explicit splitting methodology) for reproducibility.
Hardware Specification	No	The paper mentions using 'GPT-3.5-turbo', 'GPT-4', and 'Llama2' models, but it does not provide specific details on the hardware (e.g., GPU/CPU models, memory) used to run the experiments.
Software Dependencies	No	The code is written in Python and imports the datasets from Hugging Face to build Py Torch s data loader. However, specific version numbers for Python, PyTorch, Hugging Face libraries, or any other software dependencies are not provided.
Experiment Setup	Yes	For LLMs with TR, the default settings for temperature and top p are 0.7.