SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning
Authors: Ning Miao, Yee Whye Teh, Tom Rainforth
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test Self Check on math and logic-based datasets and find that it successfully recognizes errors and, in turn, increases final answer accuracies. |
| Researcher Affiliation | Academia | Ning Miao1* Yee Whye Teh1 Tom Rainforth1 1Department of Statistics, University of Oxford. |
| Pseudocode | No | The paper describes the stages of Self Check but does not provide structured pseudocode or an algorithm block. |
| Open Source Code | No | The paper does not contain any statement or link indicating the availability of open-source code for the described methodology. |
| Open Datasets | Yes | GSM8K (Cobbe et al., 2021), Math QA (Amini et al., 2019), and MATH (Hendrycks et al., 2021) |
| Dataset Splits | No | The paper mentions using test sets and a subset of the MATH test set, but does not provide specific details on training, validation, and test splits (e.g., percentages or exact counts for each split). |
| Hardware Specification | No | Restricted by the speed of Llama2 (70B, 4-bit) on our server (which is only 20 tokens/s) This indicates the use of a server and a specific Llama2 model but lacks specific hardware components (GPU, CPU models, etc.). |
| Software Dependencies | Yes | We use GPT-3.5 (gpt-3.5-0301) and GPT-4 (gpt-4-0613) as our LLMs, focusing in particular on the former due to budget restrictions. An additional experiment using Llama2 (70B, 4-bit, Touvron et al. (2023)) is provided in Appendix E. |
| Experiment Setup | Yes | We also fix λ1 = 1 and λ0 = 0.3 throughout our experiments. Because of the high cost of calling the GPT-4 API, we randomly sample 500 questions from each dataset to form the test sets and generate 2 (instead of 10) answers to each question. |