Deductive Verification of Chain-of-Thought Reasoning
Authors: Zhan Ling, Yunhao Fang, Xuanlin Li, Zhiao Huang, Mingu Lee, Roland Memisevic, Hao Su
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the efficacy of our natural program-based verification approach across a range of arithmetic and common sense datasets on publicly-available models like Open AI s GPT-3.5-turbo. In this section, we perform evaluations to demonstrate the effectiveness of our Natural Program-based deductive reasoning verification approach over diverse reasoning datasets. |
| Researcher Affiliation | Collaboration | Zhan Ling1 Yunhao Fang1 Xuanlin Li1 Zhiao Huang1 Mingu Lee2 Roland Memisevic2 Hao Su1 1UC San Diego, 2Qualcomm AI Research |
| Pseudocode | No | The paper describes the 'Natural Program' format through examples but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | Code will be released at https://github.com/lz1oceani/verify_cot. |
| Open Datasets | Yes | For arithmetic reasoning, we utilize the following benchmarks: 1) Add Sub [19]; 2) GSM8K [10]; 3) MATH [17]; 4) AQu A [24]. ... For symbol manipulation, we use Last Letter Concatenation [50]... For date understanding, we use the one from BIG-bench [45] |
| Dataset Splits | No | The paper uses standard benchmarks like GSM8K and MATH but does not explicitly provide training, validation, or test split percentages or counts for these datasets for the main GPT-3.5 experiments. For the Vicuna fine-tuning, a generated dataset of 2000 reasoning steps is used, but its train/validation/test splits are not specified. |
| Hardware Specification | Yes | Models were fine-tuned with 4 A100-80GB over 3 epochs. |
| Software Dependencies | No | The paper mentions models like GPT-3.5-turbo and Vicuna, and the Adam W optimizer, but does not provide specific version numbers for software dependencies such as programming languages or libraries (e.g., Python, PyTorch). |
| Experiment Setup | Yes | For Chat GPT, we use a generation temperature of T = 0.7. For Unanimity-Plurality Voting, we set k = 10 and k = 3 by default. We use 1-shot prompting for both reasoning chain generation and deductive verification (except reasoning chain generation for the date understanding task where we use 2-shot). Hyperparameters: Optimizer Adam W, Learning rate 1e-5, Weight decay 0.00, Num epochs 3, Batch size 64, Learning rate schedule Linear. |