When Do Program-of-Thought Works for Reasoning?

Authors: Zhen Bi, Ningyu Zhang, Yinuo Jiang, Shumin Deng, Guozhou Zheng, Huajun Chen

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through an empirical analysis, we find not all code data of complexity can be learned or understood by LLMs. Optimal level of complexity is critical to the improvement of reasoning abilities by program-aided prompting. Then we design an autosynthesizing and stratifying algorithm, and apply it to instruction generation for mathematical reasoning and code data filtering for code generation tasks. Extensive results demonstrates the effectiveness of our proposed approach.
Researcher Affiliation Collaboration Zhen Bi1,2, Ningyu Zhang1,2*, Yinuo Jiang1,2, Shumin Deng4, Guozhou Zheng1,2,3, Huajun Chen1,2,3* 1Zhejiang University 2Zhejiang University Ant Group Joint Laboratory of Knowledge Graph 3Donghai Laboratory 4NUS-NCS Joint Lab, National University of Singapore
Pseudocode Yes Algorithm 1: Auto-Synthesizing and Stratifying
Open Source Code No The paper does not contain an explicit statement or a direct link to the source code for the methodology or tools developed by the authors in this paper. It refers to Code Alpaca with a GitHub link in the bibliography, but that is a third-party resource used as a dataset, not their own implementation code.
Open Datasets Yes The sources of seed data include the training set of GSM8K (Cobbe et al. 2021), Multi Arith (Roy and Roth 2015), Asdiv (Miao, Liang, and Su 2020), SVAMP (Patel, Bhattamishra, and Goyal 2021) and AQu A (Ling et al. 2017). For each dataset, we generate approximately 10,000 samples.
Dataset Splits Yes We randomly select 1,700 instances from each subset (low, medium, high) to build the training and validation dataset for fair comparisons. For all generated data, we randomly sampled 10% and verified its correctness by manual checks and automated validation with GPT-4, ensuring the accuracy within a reasonable margin of error.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions using Python for code examples and Chat GPT/GPT-4 for data generation/validation but does not list specific version numbers for any programming languages, libraries, or other software dependencies critical for reproducing the experiment environment.
Experiment Setup Yes For few-shot setting, we choose 3-shot for evaluation. We train three models based on LLa MA (Version 1.0) from 7 billion to 65 billion parameters. We randomly select 1,700 instances from each subset (low, medium, high) to build the training and validation dataset for fair comparisons. We choose gpt-3.5-turbo as the main benchmark model and accuracy (Acc) as our main evaluation metric. The experimental setup is shown in the supplementary.