reproducibilityindex.ai

When Do Program-of-Thought Works for Reasoning?

Authors: Zhen Bi, Ningyu Zhang, Yinuo Jiang, Shumin Deng, Guozhou Zheng, Huajun Chen

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through an empirical analysis, we find not all code data of complexity can be learned or understood by LLMs. Optimal level of complexity is critical to the improvement of reasoning abilities by program-aided prompting. Then we design an autosynthesizing and stratifying algorithm, and apply it to instruction generation for mathematical reasoning and code data filtering for code generation tasks. Extensive results demonstrates the effectiveness of our proposed approach.
Researcher Affiliation	Collaboration	Zhen Bi1,2, Ningyu Zhang1,2, Yinuo Jiang1,2, Shumin Deng4, Guozhou Zheng1,2,3, Huajun Chen1,2,3 1Zhejiang University 2Zhejiang University Ant Group Joint Laboratory of Knowledge Graph 3Donghai Laboratory 4NUS-NCS Joint Lab, National University of Singapore
Pseudocode	Yes	Algorithm 1: Auto-Synthesizing and Stratifying
Open Source Code	No	The paper does not contain an explicit statement or a direct link to the source code for the methodology or tools developed by the authors in this paper. It refers to Code Alpaca with a GitHub link in the bibliography, but that is a third-party resource used as a dataset, not their own implementation code.
Open Datasets	Yes	The sources of seed data include the training set of GSM8K (Cobbe et al. 2021), Multi Arith (Roy and Roth 2015), Asdiv (Miao, Liang, and Su 2020), SVAMP (Patel, Bhattamishra, and Goyal 2021) and AQu A (Ling et al. 2017). For each dataset, we generate approximately 10,000 samples.
Dataset Splits	Yes	We randomly select 1,700 instances from each subset (low, medium, high) to build the training and validation dataset for fair comparisons. For all generated data, we randomly sampled 10% and verified its correctness by manual checks and automated validation with GPT-4, ensuring the accuracy within a reasonable margin of error.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions using Python for code examples and Chat GPT/GPT-4 for data generation/validation but does not list specific version numbers for any programming languages, libraries, or other software dependencies critical for reproducing the experiment environment.
Experiment Setup	Yes	For few-shot setting, we choose 3-shot for evaluation. We train three models based on LLa MA (Version 1.0) from 7 billion to 65 billion parameters. We randomly select 1,700 instances from each subset (low, medium, high) to build the training and validation dataset for fair comparisons. We choose gpt-3.5-turbo as the main benchmark model and accuracy (Acc) as our main evaluation metric. The experimental setup is shown in the supplementary.