Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Decompose, Analyze and Rethink: Solving Intricate Problems with Human-like Reasoning Cycle
Authors: Shangzi Xue, Zhenya Huang, Jiayu Liu, Xin Lin, Yuting Ning, Binbin Jin, Xin Li, Qi Liu
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on three reasoning benchmarks, including Science QA, Strategy QA, and GSM8K, which cover a variety of reasoning tasks, demonstrating that our approach significantly reduces logical errors and enhances performance across various LLMs. |
| Researcher Affiliation | Academia | 1: State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China 2: Institute of Artificial Intelligence, Hefei Comprehensive National Science Center EMAIL; EMAIL |
| Pseudocode | Yes | Algorithm 1 Decompose-Analyze-Rethink |
| Open Source Code | Yes | Our code is available at: https://github.com/Shangzi Xue/De AR |
| Open Datasets | Yes | We employ the Science QA [28] dataset for the knowledge reasoning task. And we use Strategy QA [12] for logical reasoning that requires multiple reasoning steps. We also verify the mathematical reasoning ability of our framework by applying it to GSM8K dataset [8]. |
| Dataset Splits | Yes | For each dataset, we randomly sample 10% of its training set as a validation set to select different combinations of thresholds ϵ1 and ϵ2. |
| Hardware Specification | No | The paper mentions using GPT-3.5, LLaMA2-7B, and ChatGLM3-6B as LLM backbones but does not specify the hardware (e.g., specific GPU models or CPU types) on which these models were run for their experiments. |
| Software Dependencies | No | The paper mentions using specific LLM backbones like GPT-3.5, LLaMA2-7B, and ChatGLM3-6B, and accessing Open AI API. However, it does not specify software dependencies like Python versions, specific library versions (e.g., PyTorch, TensorFlow), or CUDA versions with numerical identifiers. |
| Experiment Setup | Yes | To ensure computational efficiency, we set the maximum depth to 4 and the maximum number of branches to 3 during the construction of the reasoning tree in De AR. For each dataset, we randomly sample 10% of its training set as a validation set to select different combinations of thresholds ϵ1 and ϵ2. |