reproducibilityindex.ai

miniF2F: a cross-system benchmark for formal Olympiad-level mathematics

Authors: Kunhao Zheng, Jesse Michael Han, Stanislas Polu

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We report baseline results using GPT-f (Polu & Sutskever, 2020), a neural theorem prover based on GPT-3 (Brown et al., 2020) and provide an analysis of its performance. In this section, in order to study baseline performances associated with existing systems, we report pass rates achieved by GPT-f (Polu & Sutskever, 2020) applied to Metamath, GPT-f/PACT (Polu & Sutskever, 2020; Han et al., 2021) applied to Lean as well as a baseline prover implemented in Lean denoted as the tidy baseline.
Researcher Affiliation	Collaboration	Kunhao Zheng Ecole Polytechnique kunhao.zheng@polytechnique.edu Jesse Michael Han Open AI University of Pittsburgh jessemichaelhan@openai.com Stanislas Polu Open AI spolu@openai.com
Pseudocode	No	The paper describes algorithms in text but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	1https://github.com/openai/mini F2F/tree/v1
Open Datasets	Yes	mini F2F is a dataset of manually formalized statements of Olympiad type problems, aligned in Lean, Metamath, and Isabelle (partial at the time of writing), providing a cross-platform benchmark for formal mathematical reasoning. mini F2F comprises a test set and a validation set, which are a stratiﬁed random split from the statements we formalized such that each set equally covers each problem type and difﬁculty (when available). Table 1 shows a detailed distribution of these statements. The current version of the benchmark is v11 and results in this paper are reported using this version. v1 comprises 244 test and 244 valid statements. 1https://github.com/openai/mini F2F/tree/v1
Dataset Splits	Yes	mini F2F comprises a test set and a validation set, which are a stratiﬁed random split from the statements we formalized such that each set equally covers each problem type and difﬁculty (when available). v1 comprises 244 test and 244 valid statements.
Hardware Specification	No	The paper does not provide specific details on the hardware used for running the experiments, such as GPU/CPU models or memory specifications.
Software Dependencies	No	The paper mentions software like Metamath, Lean, GPT-f, and various tactics, but it does not specify concrete version numbers for these software dependencies, which are necessary for reproducibility.
Experiment Setup	Yes	All tidy baseline experiments are run with ωmax = 128, dmax = 8 using L + HL. GPT-f attempts e = 16 tactics per expansion while the tidy baseline attempts e = 17 tactics per expansion (L + HL, see section 4.2.1).