Synthetic Prompting: Generating Chain-of-Thought Demonstrations for Large Language Models

Authors: Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, Weizhu Chen

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method on numerical, symbolic, and algorithmic reasoning tasks, and show that it outperforms existing prompting techniques.
Researcher Affiliation Collaboration 1Tsinghua University 2Microsoft Research Asia 3Microsoft Azure AI. Correspondence to: Minlie Huang <aihuang@tsinghua.edu.cn>.
Pseudocode Yes In our main experiments, we use PAL-style reasoning, i.e., reasoning chains are snippets of code, and answers are obtained by executing the code. Figure 1 (left) shows an example prompt for the backward process, which includes some demonstrations randomly sampled from the seed examples and the previously synthesized ones. The example code snippets provided (e.g., 'def solution():' blocks) serve as structured, code-like representations of the reasoning process.
Open Source Code No No, the paper does not include an unambiguous statement of code release or provide a link to a source-code repository for their methodology.
Open Datasets Yes We experimented on seven datasets of different reasoning tasks. Examples are presented in Table 1. Numerical reasoning (1) GSM8K (Cobbe et al., 2021) is a dataset of 1,319 diverse grade school math word problems... (3) SVAMP (Patel et al., 2021)... (4) ASDiv (Miao et al., 2020)... (5) Single Op (Koncel-Kedziorski et al., 2016)... Symbolic reasoning The Colored Objects task from Big Bench Hard (Suzgun et al., 2022)... Algorithmic reasoning The Repeat Copy task also comes from Big-Bench Hard...
Dataset Splits No No, the paper mentions using 'seed examples' and evaluating on 'test sets' of established datasets, but it does not specify explicit training/validation/test splits (e.g., percentages or absolute counts) for the datasets themselves that would allow reproduction of data partitioning.
Hardware Specification No No, the paper mentions the use of specific language models (e.g., 'text-davinci-003 version of Instruct GPT', 'code-davinci-002 and cushman') as backend LLMs, but it does not provide any details about the underlying hardware (e.g., GPU models, CPU types, or memory) used for these models or for running their experiments.
Software Dependencies No No, the paper mentions tools like 'Sentence BERT' and 'all-mpnet-base-v2' but does not provide specific version numbers for these or any other software libraries or dependencies used in the experiments.
Experiment Setup Yes We used top-p sampling (Holtzman et al., 2020) for synthesis with temperature set to 0.7, and used greedy decoding for inference with temperature set to 0. All numerical reasoning datasets share one set of seed examples either randomly sampled from GSM8K (when the number of seeds is 2 or 4) or from Wei et al. (2022b) (when the number of seeds is 8)... Target complexities range from the lowest complexity of the seed examples to the highest one plus c; c was set to 4 for numerical reasoning and 2 on the other datasets. In forward synthesis, the number of reasoning chains sampled for each question was 3. The encoder used for clustering was all-mpnet-base-v2.