LLMs Can Find Mathematical Reasoning Mistakes by Pedagogical Chain-of-Thought

Authors: Zhuoxuan Jiang, Haoyuan Peng, Shanshan Feng, Fan Li, Dongsheng Li

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our approach on two public datasets featuring math problems of varying difficulty levels. The experiments demonstrate that our zero-shot prompting strategy significantly outperforms strong baselines.
Researcher Affiliation Collaboration 1Shanghai Business School, Shanghai, China 2Learnable.AI Inc., Shanghai, China 3Centre for Frontier AI Research, A*STAR, Singapore 4Institute of High-Performance Computing, A*STAR, Singapore 5The Hong Kong Polytechnic University, Hong Kong, China 6Microsoft Research Asia, Shanghai, China
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Accessible on github.com/Haoyuan Peng/Ped Co T-IJCAI24/
Open Datasets Yes We collect two public datasets containing step-level correctness labels for mathematical problems with different difficulties. BIG-Bench Mistake [Tyen et al., 2023]: PRM800K [Lightman et al., 2023]:
Dataset Splits No The paper describes the datasets and their selection for experiments, but does not explicitly provide details about specific training/validation/test splits (e.g., percentages or counts for each split) used for reproducibility.
Hardware Specification No The paper does not specify the exact hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies No The paper does not provide specific version numbers for ancillary software dependencies (e.g., libraries, frameworks, or programming languages beyond the general mention of LLMs).
Experiment Setup Yes The temperature for generation is consistently set to 0 for both models to minimize the diversity of model outputs.