Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Decomposed Prompting: A Modular Approach for Solving Complex Tasks
Authors: Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, Ashish Sabharwal
ICLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To illustrate these advantages of DECOMP, we empirically evaluate it against prior work on eight challenging datasets using GPT3 models |
| Researcher Affiliation | Collaboration | Allen Institute for AI Stony Brook University University of Edinburgh EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 A recursive reversal strategy that splits the sequence in half, reverses each half, and concatenates them. Runs in O(log n) calls to the LM where n is the number of items in the sequence. procedure SPLITREVERSE(x) |
| Open Source Code | Yes | Datasets, Code and Prompts available at https://github.com/allenai/DecomP. |
| Open Datasets | Yes | We use Hotpot QA in the fullwiki setting where it comes with the associated Wikipedia corpus for open-domain QA. 2Wiki Multihop QA and Mu Si Que, however, are originally reading comprehension datasets. ... To turn these datasets into open-domain QA datasets, we create a corpora for each dataset by combining all the paragraphs in the train, dev and test questions. |
| Dataset Splits | Yes | We manually annotate Co Ts and decompositions for 20 training set questions, and sample 3 prompts of 15 questions each for all approaches. The detailed prompts are given in the Appendix G. We evaluate on 300 held-out dev questions in each dataset. |
| Hardware Specification | No | The paper specifies the LLM models used (e.g., 'text-davinci-002 Instruct GPT3 model', 'Codex (code-davinci-002) model', 'Flan-T5-Large', 'Flan-T5-XL', 'Flan-T5-XXL') but does not provide specific hardware details (like GPU models, CPU types, or memory) on which these models or the experiments were run. |
| Software Dependencies | No | The paper refers to specific LLM models (e.g., GPT3 text-davinci-002, Codex code-davinci-002, Flan-T5 family) but does not provide details on specific software libraries or their version numbers (e.g., Python, PyTorch, TensorFlow versions, or other dependencies) required for replication. |
| Experiment Setup | Yes | For No Decomp-Ctxt, we search K {6, 8, 10} for GPT3 models and K 2, 4, 6, 8 for Flan-T5-* models. For Decomp-Ctxt, we search K {2, 4, 6} for GPT3 and Flan-T5-* models. |