HyperTree Proof Search for Neural Theorem Proving

Authors: Guillaume Lample, Timothee Lacroix, Marie-Anne Lachaux, Aurelien Rodriguez, Amaury Hayat, Thibaut Lavril, Gabriel Ebner, Xavier Martinet

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We report detailed ablations of our pipeline s main components by studying performance on three environments of increasing complexity. In particular, we show that with HTPS alone, a model trained on annotated proofs manages to prove 65.4% of a held-out set of Metamath theorems, significantly outperforming the previous state of the art of 56.5% by GPT-f. Online training on these unproved theorems increases accuracy to 82.6%. With a similar computational budget, we improve the state of the art on the Lean-based mini F2F-curriculum dataset from 31% to 42% proving accuracy.
Researcher Affiliation Collaboration Guillaume Lample Marie-Anne Lachaux Thibaut Lavril Xavier Martinet Amaury Hayat Gabriel Ebner Aurélien Rodriguez Timothée Lacroix Equal contribution. Corresponding authors: {glample,malachaux,tlacroix}@fb.com Meta AI Research Vrije Universiteit Amsterdam CERMICS École des Ponts Paris Tech
Pseudocode Yes To do so, we recursively follow the arg-max of the search policy from the root, until we reach the leaves of the current hypergraph (the detailed pseudo-code can be found in Section C.4).
Open Source Code No The code for the Equations environment will be open-sourced. We also plan to make our trained model publicly available to help people in the formal community. Code for the overall distributed training architecture is tied to our infrastructure and will be difficult to open-source.
Open Datasets Yes We develop and test our methods on three theorem proving environments: Metamath, Lean and Equations. Metamath [25] comes with a database of 35k human-written theorems called set.mm. ... Lean comes with a human-written library of 27k theorems called Mathlib [26]. ... Finally, we add another supervised co-training task by converting to Lean a synthetic dataset of theorems generated by the Equations environment... mini F2F dataset [6].
Dataset Splits Yes The produced model is then evaluated on unseen statements, namely the validation and test splits of the mini F2F dataset [6]. In the transductive setup, we also report the cumulative pass rate, i.e. the proportion of theorems solved at least once during online training. ... During online training, we sample statements from the training and from the validation splits of set.mm equally.
Hardware Specification Yes In Lean, we run our experiments on A100 GPUs with 32 trainers and 200 provers. ... On Metamath, we train our model on V100 GPUs, with 128 trainers and 256 provers, whereas ablations are run on 16 trainers and 32 provers.
Software Dependencies No The paper mentions using 'transformer-based' models and specific frameworks like Lean, Metamath, and Equations environments, but does not provide specific version numbers for software dependencies (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup Yes Our proof-search depends on many hyper-parameters, and the optimal settings might not be the same for all statements, making tuning impractical. Thus, the controller samples these hyper-parameters from pre-defined ranges (see Appendix D.3 for details) for each different proof-search attempt. ... We give more details on the hyper-parameters used in Appendix D.3.