LLMs Can Find Mathematical Reasoning Mistakes by Pedagogical Chain-of-Thought
Authors: Zhuoxuan Jiang, Haoyuan Peng, Shanshan Feng, Fan Li, Dongsheng Li
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach on two public datasets featuring math problems of varying difficulty levels. The experiments demonstrate that our zero-shot prompting strategy significantly outperforms strong baselines. |
| Researcher Affiliation | Collaboration | 1Shanghai Business School, Shanghai, China 2Learnable.AI Inc., Shanghai, China 3Centre for Frontier AI Research, A*STAR, Singapore 4Institute of High-Performance Computing, A*STAR, Singapore 5The Hong Kong Polytechnic University, Hong Kong, China 6Microsoft Research Asia, Shanghai, China |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Accessible on github.com/Haoyuan Peng/Ped Co T-IJCAI24/ |
| Open Datasets | Yes | We collect two public datasets containing step-level correctness labels for mathematical problems with different difficulties. BIG-Bench Mistake [Tyen et al., 2023]: PRM800K [Lightman et al., 2023]: |
| Dataset Splits | No | The paper describes the datasets and their selection for experiments, but does not explicitly provide details about specific training/validation/test splits (e.g., percentages or counts for each split) used for reproducibility. |
| Hardware Specification | No | The paper does not specify the exact hardware (e.g., GPU models, CPU types, memory) used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for ancillary software dependencies (e.g., libraries, frameworks, or programming languages beyond the general mention of LLMs). |
| Experiment Setup | Yes | The temperature for generation is consistently set to 0 for both models to minimize the diversity of model outputs. |