Formal Mathematics Statement Curriculum Learning
Authors: Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin, Ilya Sutskever
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that at same compute budget, expert iteration, by which we mean proof search interleaved with learning, dramatically outperforms proof search only. We also observe that when applied to a collection of formal statements of sufficiently varied difficulty, expert iteration is capable of finding and solving a curriculum of increasingly difficult problems, without the need for associated ground-truth proofs. Finally, by applying this expert iteration to a manually curated set of problem statements, we surpass previous state-of-the-art on the mini F2F benchmark, automatically solving multiple challenging problems drawn from high school olympiads. |
| Researcher Affiliation | Collaboration | Stanislas Polu Open AI Jesse Michael Han Multi Technologies Kunhao Zheng École Polytechnique Mantas Baksys University of Cambridge Igor Babuschkin Deep Mind Ilya Sutskever Open AI |
| Pseudocode | No | The paper describes algorithms (e.g., in Appendix F for the inequality generator and Section 4.4 for expert iteration), and includes an illustration of expert iteration in Appendix E (Figure 6), but it does not provide any formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | To solve these issues we implemented lean-gym1 a simple REPL interface over the standard input/output implemented in Lean directly. ... 1https://github.com/openai/lean-gym |
| Open Datasets | Yes | mini F2F benchmark In this work, we target the mini F2F (Zheng et al., 2022) benchmark, which consists of 244 validation and 244 test formalized statements... We also extract mathlib-{train, valid, test}, the set of statements from mathlib along the split proposed in Han et al. (2022)... Using this generator we generate a curriculum of 5600 inequality statements... We denote this set of statements as synth-ineq... We refer to this dataset as synth-ineq-train... The 327 statements of mini F2F-curriculum3 are manually formalized from: AOPS Books (Lehoczky & Rusczyk, a;b)... MATH (Hendrycks et al., 2021) dataset: 25 problems. ... 3https://github.com/openai/miniF2F/tree/statement_curriculum_learning/lean/src/statement_curriculum_learning |
| Dataset Splits | Yes | We also extract mathlib-{train, valid, test}, the set of statements from mathlib along the split proposed in Han et al. (2022) (the validation and test splits of tactic, mix1, mix2 being aligned with mathlib-{valid, test} as the splits are determined by declaration name hashes (across all data sources including proof-term mining) as opposed to individual proof steps or data-points. |
| Hardware Specification | Yes | Indicatively, with our 774m parameters model, running a full expert iteration to train θfull 9 required about 2000 A100 days of compute. Running one full proof search (a = 1 d = 512 e = 8) when properly parallelised, requires on average about 0.1 A100 hour of compute. |
| Software Dependencies | No | The paper mentions key software components like Lean, lean-gym (their own implementation), and GPT-3, along with datasets like Mathlib and Common Crawl. However, it does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | We attempt a proof searches for each statement s(s St) with d = 512 expansions and e = 8 samples per expansion. |