Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
LLMs Can Find Mathematical Reasoning Mistakes by Pedagogical Chain-of-Thought
Authors: Zhuoxuan Jiang, Haoyuan Peng, Shanshan Feng, Fan Li, Dongsheng Li
IJCAI 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach on two public datasets featuring math problems of varying difficulty levels. The experiments demonstrate that our zero-shot prompting strategy significantly outperforms strong baselines. |
| Researcher Affiliation | Collaboration | 1Shanghai Business School, Shanghai, China 2Learnable.AI Inc., Shanghai, China 3Centre for Frontier AI Research, A*STAR, Singapore 4Institute of High-Performance Computing, A*STAR, Singapore 5The Hong Kong Polytechnic University, Hong Kong, China 6Microsoft Research Asia, Shanghai, China |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Accessible on github.com/Haoyuan Peng/Ped Co T-IJCAI24/ |
| Open Datasets | Yes | We collect two public datasets containing step-level correctness labels for mathematical problems with different difficulties. BIG-Bench Mistake [Tyen et al., 2023]: PRM800K [Lightman et al., 2023]: |
| Dataset Splits | No | The paper describes the datasets and their selection for experiments, but does not explicitly provide details about specific training/validation/test splits (e.g., percentages or counts for each split) used for reproducibility. |
| Hardware Specification | No | The paper does not specify the exact hardware (e.g., GPU models, CPU types, memory) used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for ancillary software dependencies (e.g., libraries, frameworks, or programming languages beyond the general mention of LLMs). |
| Experiment Setup | Yes | The temperature for generation is consistently set to 0 for both models to minimize the diversity of model outputs. |