Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Deductive Verification of Chain-of-Thought Reasoning
Authors: Zhan Ling, Yunhao Fang, Xuanlin Li, Zhiao Huang, Mingu Lee, Roland Memisevic, Hao Su
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the efficacy of our natural program-based verification approach across a range of arithmetic and common sense datasets on publicly-available models like Open AI s GPT-3.5-turbo. In this section, we perform evaluations to demonstrate the effectiveness of our Natural Program-based deductive reasoning verification approach over diverse reasoning datasets. |
| Researcher Affiliation | Collaboration | Zhan Ling1 Yunhao Fang1 Xuanlin Li1 Zhiao Huang1 Mingu Lee2 Roland Memisevic2 Hao Su1 1UC San Diego, 2Qualcomm AI Research |
| Pseudocode | No | The paper describes the 'Natural Program' format through examples but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | Code will be released at https://github.com/lz1oceani/verify_cot. |
| Open Datasets | Yes | For arithmetic reasoning, we utilize the following benchmarks: 1) Add Sub [19]; 2) GSM8K [10]; 3) MATH [17]; 4) AQu A [24]. ... For symbol manipulation, we use Last Letter Concatenation [50]... For date understanding, we use the one from BIG-bench [45] |
| Dataset Splits | No | The paper uses standard benchmarks like GSM8K and MATH but does not explicitly provide training, validation, or test split percentages or counts for these datasets for the main GPT-3.5 experiments. For the Vicuna fine-tuning, a generated dataset of 2000 reasoning steps is used, but its train/validation/test splits are not specified. |
| Hardware Specification | Yes | Models were fine-tuned with 4 A100-80GB over 3 epochs. |
| Software Dependencies | No | The paper mentions models like GPT-3.5-turbo and Vicuna, and the Adam W optimizer, but does not provide specific version numbers for software dependencies such as programming languages or libraries (e.g., Python, PyTorch). |
| Experiment Setup | Yes | For Chat GPT, we use a generation temperature of T = 0.7. For Unanimity-Plurality Voting, we set k = 10 and k = 3 by default. We use 1-shot prompting for both reasoning chain generation and deductive verification (except reasoning chain generation for the date understanding task where we use 2-shot). Hyperparameters: Optimizer Adam W, Learning rate 1e-5, Weight decay 0.00, Num epochs 3, Batch size 64, Learning rate schedule Linear. |