miniF2F: a cross-system benchmark for formal Olympiad-level mathematics
Authors: Kunhao Zheng, Jesse Michael Han, Stanislas Polu
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We report baseline results using GPT-f (Polu & Sutskever, 2020), a neural theorem prover based on GPT-3 (Brown et al., 2020) and provide an analysis of its performance. In this section, in order to study baseline performances associated with existing systems, we report pass rates achieved by GPT-f (Polu & Sutskever, 2020) applied to Metamath, GPT-f/PACT (Polu & Sutskever, 2020; Han et al., 2021) applied to Lean as well as a baseline prover implemented in Lean denoted as the tidy baseline. |
| Researcher Affiliation | Collaboration | Kunhao Zheng Ecole Polytechnique kunhao.zheng@polytechnique.edu Jesse Michael Han Open AI University of Pittsburgh jessemichaelhan@openai.com Stanislas Polu Open AI spolu@openai.com |
| Pseudocode | No | The paper describes algorithms in text but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1https://github.com/openai/mini F2F/tree/v1 |
| Open Datasets | Yes | mini F2F is a dataset of manually formalized statements of Olympiad type problems, aligned in Lean, Metamath, and Isabelle (partial at the time of writing), providing a cross-platform benchmark for formal mathematical reasoning. mini F2F comprises a test set and a validation set, which are a stratified random split from the statements we formalized such that each set equally covers each problem type and difficulty (when available). Table 1 shows a detailed distribution of these statements. The current version of the benchmark is v11 and results in this paper are reported using this version. v1 comprises 244 test and 244 valid statements. 1https://github.com/openai/mini F2F/tree/v1 |
| Dataset Splits | Yes | mini F2F comprises a test set and a validation set, which are a stratified random split from the statements we formalized such that each set equally covers each problem type and difficulty (when available). v1 comprises 244 test and 244 valid statements. |
| Hardware Specification | No | The paper does not provide specific details on the hardware used for running the experiments, such as GPU/CPU models or memory specifications. |
| Software Dependencies | No | The paper mentions software like Metamath, Lean, GPT-f, and various tactics, but it does not specify concrete version numbers for these software dependencies, which are necessary for reproducibility. |
| Experiment Setup | Yes | All tidy baseline experiments are run with ωmax = 128, dmax = 8 using L + HL. GPT-f attempts e = 16 tactics per expansion while the tidy baseline attempts e = 17 tactics per expansion (L + HL, see section 4.2.1). |