Proof Artifact Co-Training for Theorem Proving with Language Models

Authors: Jesse Michael Han, Jason Rute, Yuhuai Wu, Edward Ayers, Stanislas Polu

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train large language models on these data and demonstrate that PACT significantly improves theorem proving success rate on a held-out suite of test theorems, from 32% to 48%.
Researcher Affiliation Collaboration Jesse Michael Han University of Pittsburgh Open AI Jason Rute IBM Research Yuhuai Wu Google Research Stanford University Edward W. Ayers Carnegie Mellon University Stanislas Polu Open AI
Pseudocode No The paper describes the 'best-first search algorithm' in Section 3.3 but presents it in paragraph form rather than as a structured pseudocode or algorithm block.
Open Source Code Yes The source code used to generate the Lean datasets and run the evaluation is open source and made available in the following repositories: Lean theorem proving environment : https://github.com/jesse-michael-han/lean-tpe-public Tactic step data pipeline : https://github.com/jasonrute/lean_proof_recording PACT data pipeline : https://github.com/jesse-michael-han/lean-step-public
Open Datasets No LEANSTEP is the first and only tactic proof dataset for the Lean theorem prover. This makes available a large set of formal mathematical data to researchers covering a diverse and deep spectrum of pure mathematics. (The paper explains how the dataset is generated and provides code for generating it, but not a direct link to the processed dataset itself.)
Dataset Splits Yes We use an 80-5-15 train-validation-test split. We split all datapoints deterministically by theorem name, by hashing each name to a float in (0, 1).
Hardware Specification Yes Fine-tuning on our largest dataset mix1 + mix2 + tactic required 26 hours using 64 A100 GPUs exhibiting high FP16 usage, totalling an estimated 1.5K A100(FP16)-hours. Each evaluation required 10 hours with 30% GPU utilization. We observed that our evaluation was bottlenecked by inference and in practice, we hosted up to three evaluation loops at once on a VM with 80 logical cores without achieving full CPU utilization.
Software Dependencies No The provided code also supports querying a locally hosted Transformer from the open-source library fairseq via the Fairseq CLI (Ott et al., 2019). (The paper mentions 'fairseq' but does not provide a specific version number for it or any other key software dependency.)
Experiment Setup Yes Unless mentioned otherwise, all of our models have 24 layers with dmodel = 1536 and 24 heads, accruing to 837M trainable parameters... We use the standard BPE encoding (Brown et al., 2020), a batch size of 512 and a learning rate of 0.00025 with a cosine schedule and a 100-step ramp-up.