reproducibilityindex.ai

Formal Mathematics Statement Curriculum Learning

Authors: Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin, Ilya Sutskever

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that at same compute budget, expert iteration, by which we mean proof search interleaved with learning, dramatically outperforms proof search only. We also observe that when applied to a collection of formal statements of sufficiently varied difficulty, expert iteration is capable of finding and solving a curriculum of increasingly difficult problems, without the need for associated ground-truth proofs. Finally, by applying this expert iteration to a manually curated set of problem statements, we surpass previous state-of-the-art on the mini F2F benchmark, automatically solving multiple challenging problems drawn from high school olympiads.
Researcher Affiliation	Collaboration	Stanislas Polu Open AI Jesse Michael Han Multi Technologies Kunhao Zheng École Polytechnique Mantas Baksys University of Cambridge Igor Babuschkin Deep Mind Ilya Sutskever Open AI
Pseudocode	No	The paper describes algorithms (e.g., in Appendix F for the inequality generator and Section 4.4 for expert iteration), and includes an illustration of expert iteration in Appendix E (Figure 6), but it does not provide any formal pseudocode or algorithm blocks.
Open Source Code	Yes	To solve these issues we implemented lean-gym1 a simple REPL interface over the standard input/output implemented in Lean directly. ... 1https://github.com/openai/lean-gym
Open Datasets	Yes	mini F2F benchmark In this work, we target the mini F2F (Zheng et al., 2022) benchmark, which consists of 244 validation and 244 test formalized statements... We also extract mathlib-{train, valid, test}, the set of statements from mathlib along the split proposed in Han et al. (2022)... Using this generator we generate a curriculum of 5600 inequality statements... We denote this set of statements as synth-ineq... We refer to this dataset as synth-ineq-train... The 327 statements of mini F2F-curriculum3 are manually formalized from: AOPS Books (Lehoczky & Rusczyk, a;b)... MATH (Hendrycks et al., 2021) dataset: 25 problems. ... 3https://github.com/openai/miniF2F/tree/statement_curriculum_learning/lean/src/statement_curriculum_learning
Dataset Splits	Yes	We also extract mathlib-{train, valid, test}, the set of statements from mathlib along the split proposed in Han et al. (2022) (the validation and test splits of tactic, mix1, mix2 being aligned with mathlib-{valid, test} as the splits are determined by declaration name hashes (across all data sources including proof-term mining) as opposed to individual proof steps or data-points.
Hardware Specification	Yes	Indicatively, with our 774m parameters model, running a full expert iteration to train θfull 9 required about 2000 A100 days of compute. Running one full proof search (a = 1 d = 512 e = 8) when properly parallelised, requires on average about 0.1 A100 hour of compute.
Software Dependencies	No	The paper mentions key software components like Lean, lean-gym (their own implementation), and GPT-3, along with datasets like Mathlib and Common Crawl. However, it does not provide specific version numbers for any of these software dependencies.
Experiment Setup	Yes	We attempt a proof searches for each statement s(s St) with d = 512 expansions and e = 8 samples per expansion.