SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning

Authors: Ning Miao, Yee Whye Teh, Tom Rainforth

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test Self Check on math and logic-based datasets and find that it successfully recognizes errors and, in turn, increases final answer accuracies.
Researcher Affiliation Academia Ning Miao1* Yee Whye Teh1 Tom Rainforth1 1Department of Statistics, University of Oxford.
Pseudocode No The paper describes the stages of Self Check but does not provide structured pseudocode or an algorithm block.
Open Source Code No The paper does not contain any statement or link indicating the availability of open-source code for the described methodology.
Open Datasets Yes GSM8K (Cobbe et al., 2021), Math QA (Amini et al., 2019), and MATH (Hendrycks et al., 2021)
Dataset Splits No The paper mentions using test sets and a subset of the MATH test set, but does not provide specific details on training, validation, and test splits (e.g., percentages or exact counts for each split).
Hardware Specification No Restricted by the speed of Llama2 (70B, 4-bit) on our server (which is only 20 tokens/s) This indicates the use of a server and a specific Llama2 model but lacks specific hardware components (GPU, CPU models, etc.).
Software Dependencies Yes We use GPT-3.5 (gpt-3.5-0301) and GPT-4 (gpt-4-0613) as our LLMs, focusing in particular on the former due to budget restrictions. An additional experiment using Llama2 (70B, 4-bit, Touvron et al. (2023)) is provided in Appendix E.
Experiment Setup Yes We also fix λ1 = 1 and λ0 = 0.3 throughout our experiments. Because of the high cost of calling the GPT-4 API, we randomly sample 500 questions from each dataset to form the test sets and generate 2 (instead of 10) answers to each question.