Deductive Verification of Chain-of-Thought Reasoning

Authors: Zhan Ling, Yunhao Fang, Xuanlin Li, Zhiao Huang, Mingu Lee, Roland Memisevic, Hao Su

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the efficacy of our natural program-based verification approach across a range of arithmetic and common sense datasets on publicly-available models like Open AI s GPT-3.5-turbo. In this section, we perform evaluations to demonstrate the effectiveness of our Natural Program-based deductive reasoning verification approach over diverse reasoning datasets.
Researcher Affiliation Collaboration Zhan Ling1 Yunhao Fang1 Xuanlin Li1 Zhiao Huang1 Mingu Lee2 Roland Memisevic2 Hao Su1 1UC San Diego, 2Qualcomm AI Research
Pseudocode No The paper describes the 'Natural Program' format through examples but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No Code will be released at https://github.com/lz1oceani/verify_cot.
Open Datasets Yes For arithmetic reasoning, we utilize the following benchmarks: 1) Add Sub [19]; 2) GSM8K [10]; 3) MATH [17]; 4) AQu A [24]. ... For symbol manipulation, we use Last Letter Concatenation [50]... For date understanding, we use the one from BIG-bench [45]
Dataset Splits No The paper uses standard benchmarks like GSM8K and MATH but does not explicitly provide training, validation, or test split percentages or counts for these datasets for the main GPT-3.5 experiments. For the Vicuna fine-tuning, a generated dataset of 2000 reasoning steps is used, but its train/validation/test splits are not specified.
Hardware Specification Yes Models were fine-tuned with 4 A100-80GB over 3 epochs.
Software Dependencies No The paper mentions models like GPT-3.5-turbo and Vicuna, and the Adam W optimizer, but does not provide specific version numbers for software dependencies such as programming languages or libraries (e.g., Python, PyTorch).
Experiment Setup Yes For Chat GPT, we use a generation temperature of T = 0.7. For Unanimity-Plurality Voting, we set k = 10 and k = 3 by default. We use 1-shot prompting for both reasoning chain generation and deductive verification (except reasoning chain generation for the date understanding task where we use 2-shot). Hyperparameters: Optimizer Adam W, Learning rate 1e-5, Weight decay 0.00, Num epochs 3, Batch size 64, Learning rate schedule Linear.