reproducibilityindex.ai

Decompose, Analyze and Rethink: Solving Intricate Problems with Human-like Reasoning Cycle

Authors: Shangzi Xue, Zhenya Huang, Jiayu Liu, Xin Lin, Yuting Ning, Binbin Jin, Xin Li, Qi Liu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on three reasoning benchmarks, including Science QA, Strategy QA, and GSM8K, which cover a variety of reasoning tasks, demonstrating that our approach significantly reduces logical errors and enhances performance across various LLMs.
Researcher Affiliation	Academia	1: State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China 2: Institute of Artificial Intelligence, Hefei Comprehensive National Science Center {xueshangzi,jy251198,linx,ningyt,bb0725}@mail.ustc.edu.cn; {huangzhy,leexin,qiliuql}@ustc.edu.cn
Pseudocode	Yes	Algorithm 1 Decompose-Analyze-Rethink
Open Source Code	Yes	Our code is available at: https://github.com/Shangzi Xue/De AR
Open Datasets	Yes	We employ the Science QA [28] dataset for the knowledge reasoning task. And we use Strategy QA [12] for logical reasoning that requires multiple reasoning steps. We also verify the mathematical reasoning ability of our framework by applying it to GSM8K dataset [8].
Dataset Splits	Yes	For each dataset, we randomly sample 10% of its training set as a validation set to select different combinations of thresholds ϵ1 and ϵ2.
Hardware Specification	No	The paper mentions using GPT-3.5, LLaMA2-7B, and ChatGLM3-6B as LLM backbones but does not specify the hardware (e.g., specific GPU models or CPU types) on which these models were run for their experiments.
Software Dependencies	No	The paper mentions using specific LLM backbones like GPT-3.5, LLaMA2-7B, and ChatGLM3-6B, and accessing Open AI API. However, it does not specify software dependencies like Python versions, specific library versions (e.g., PyTorch, TensorFlow), or CUDA versions with numerical identifiers.
Experiment Setup	Yes	To ensure computational efficiency, we set the maximum depth to 4 and the maximum number of branches to 3 during the construction of the reasoning tree in De AR. For each dataset, we randomly sample 10% of its training set as a validation set to select different combinations of thresholds ϵ1 and ϵ2.