SELF-DISCOVER: Large Language Models Self-Compose Reasoning Structures
Authors: Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng-Tze Cheng, Quoc V Le, Ed Chi, Denny Zhou, Swaroop Mishra, Huaixiu (Steven) Zheng
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce SELF-DISCOVER, a general framework for LLMs to self-discover the task-intrinsic reasoning structures to tackle complex reasoning problems that are challenging for typical prompting methods. Core to the framework is a selfdiscovery process where LLMs select multiple atomic reasoning modules such as critical thinking and step-by-step thinking, and compose them into an explicit reasoning structure for LLMs to follow during decoding. SELF-DISCOVER substantially improves GPT-4 and Pa LM 2 s performance on challenging reasoning benchmarks such as Big Bench-Hard, grounded agent reasoning, and MATH, by as much as 32% compared to Chain of Thought (Co T). Furthermore, SELFDISCOVER outperforms inference-intensive methods such as Co T-Self-Consistency by more than 20%, while requiring 10-40x fewer inference compute. Finally, we show that the self-discovered reasoning structures are universally applicable across model families: from Pa LM 2-L to GPT-4, and from GPT-4 to Llama2, and share commonalities with human reasoning patterns. |
| Researcher Affiliation | Collaboration | Pei Zhou Jay Pujara Xiang Ren Xinyun Chen Heng-Tze Cheng Quoc V. Le Ed H. Chi Denny Zhou Swaroop Mishra Huaixiu Steven Zheng Google DeepMind University of Southern California |
| Pseudocode | No | The paper includes figures illustrating the process using structured prompts (e.g., Figure 9 'Meta-Prompts for the three actions of SELF-DISCOVER'), but these are not explicitly labeled as 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | Due to legal constraints, we are not able to fully release the code and data at the current time. But we provide, with the best faith, details to reproduce our results. |
| Open Datasets | Yes | We test SELF-DISCOVER on 25 challenging reasoning tasks including Big Bench-Hard (BBH) (Suzgun et al., 2022), Thinking for Doing (T4D) (Zhou et al., 2023) and MATH (Hendrycks et al., 2021). |
| Dataset Splits | No | The paper states 'we subsample 200 examples from the MATH test set' and mentions 'sampled 10 tasks with 50 examples each' for MMLU, but it does not specify explicit train/validation splits or percentages for any of the datasets, nor does it refer to predefined standard splits for all tasks. |
| Hardware Specification | No | The paper specifies the Large Language Models (LLMs) used (e.g., GPT-4, PaLM 2-L, Llama2-70B), but it does not provide details about the specific hardware (e.g., GPU models, CPU types, memory) on which these models were run for inference or the experiments. |
| Software Dependencies | No | The paper cites papers describing the LLMs used but does not explicitly list software dependencies with specific version numbers (e.g., Python version, PyTorch/TensorFlow versions, or specific libraries with their versions) required for replication. |
| Experiment Setup | Yes | Specifically, we uses three meta-prompts to guide LLMs to select, adapt, and implement an actionable reasoning structure with no labels or training required. We format the structure in key-value pairs similar to JSON due to interpretability and findings on following JSON boosts reasoning and generation quality (Zhou et al., 2023; Open AI, 2023a). The structure of the meta-prompts and full prompts are shown in Figure 9. Stage 1 operates on task-level, meaning we only need to run SELF-DISCOVER once for each task. Then, in Stage 2, we can simply use the discovered reasoning structure to solve every instance of the given task by instructing models to follow the provided structure by filling each key and arrive at a final answer. |