miniF2F: a cross-system benchmark for formal Olympiad-level mathematics

Authors: Kunhao Zheng, Jesse Michael Han, Stanislas Polu

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We report baseline results using GPT-f (Polu & Sutskever, 2020), a neural theorem prover based on GPT-3 (Brown et al., 2020) and provide an analysis of its performance. In this section, in order to study baseline performances associated with existing systems, we report pass rates achieved by GPT-f (Polu & Sutskever, 2020) applied to Metamath, GPT-f/PACT (Polu & Sutskever, 2020; Han et al., 2021) applied to Lean as well as a baseline prover implemented in Lean denoted as the tidy baseline.
Researcher Affiliation Collaboration Kunhao Zheng Ecole Polytechnique kunhao.zheng@polytechnique.edu Jesse Michael Han Open AI University of Pittsburgh jessemichaelhan@openai.com Stanislas Polu Open AI spolu@openai.com
Pseudocode No The paper describes algorithms in text but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes 1https://github.com/openai/mini F2F/tree/v1
Open Datasets Yes mini F2F is a dataset of manually formalized statements of Olympiad type problems, aligned in Lean, Metamath, and Isabelle (partial at the time of writing), providing a cross-platform benchmark for formal mathematical reasoning. mini F2F comprises a test set and a validation set, which are a stratified random split from the statements we formalized such that each set equally covers each problem type and difficulty (when available). Table 1 shows a detailed distribution of these statements. The current version of the benchmark is v11 and results in this paper are reported using this version. v1 comprises 244 test and 244 valid statements. 1https://github.com/openai/mini F2F/tree/v1
Dataset Splits Yes mini F2F comprises a test set and a validation set, which are a stratified random split from the statements we formalized such that each set equally covers each problem type and difficulty (when available). v1 comprises 244 test and 244 valid statements.
Hardware Specification No The paper does not provide specific details on the hardware used for running the experiments, such as GPU/CPU models or memory specifications.
Software Dependencies No The paper mentions software like Metamath, Lean, GPT-f, and various tactics, but it does not specify concrete version numbers for these software dependencies, which are necessary for reproducibility.
Experiment Setup Yes All tidy baseline experiments are run with ωmax = 128, dmax = 8 using L + HL. GPT-f attempts e = 16 tactics per expansion while the tidy baseline attempts e = 17 tactics per expansion (L + HL, see section 4.2.1).