reproducibilityindex.ai

SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning

Authors: Ning Miao, Yee Whye Teh, Tom Rainforth

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We test Self Check on math and logic-based datasets and find that it successfully recognizes errors and, in turn, increases final answer accuracies.
Researcher Affiliation	Academia	Ning Miao1* Yee Whye Teh1 Tom Rainforth1 1Department of Statistics, University of Oxford.
Pseudocode	No	The paper describes the stages of Self Check but does not provide structured pseudocode or an algorithm block.
Open Source Code	No	The paper does not contain any statement or link indicating the availability of open-source code for the described methodology.
Open Datasets	Yes	GSM8K (Cobbe et al., 2021), Math QA (Amini et al., 2019), and MATH (Hendrycks et al., 2021)
Dataset Splits	No	The paper mentions using test sets and a subset of the MATH test set, but does not provide specific details on training, validation, and test splits (e.g., percentages or exact counts for each split).
Hardware Specification	No	Restricted by the speed of Llama2 (70B, 4-bit) on our server (which is only 20 tokens/s) This indicates the use of a server and a specific Llama2 model but lacks specific hardware components (GPU, CPU models, etc.).
Software Dependencies	Yes	We use GPT-3.5 (gpt-3.5-0301) and GPT-4 (gpt-4-0613) as our LLMs, focusing in particular on the former due to budget restrictions. An additional experiment using Llama2 (70B, 4-bit, Touvron et al. (2023)) is provided in Appendix E.
Experiment Setup	Yes	We also fix λ1 = 1 and λ0 = 0.3 throughout our experiments. Because of the high cost of calling the GPT-4 API, we randomly sample 500 questions from each dataset to form the test sets and generate 2 (instead of 10) answers to each question.