Proof Artifact Co-Training for Theorem Proving with Language Models
Authors: Jesse Michael Han, Jason Rute, Yuhuai Wu, Edward Ayers, Stanislas Polu
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train large language models on these data and demonstrate that PACT significantly improves theorem proving success rate on a held-out suite of test theorems, from 32% to 48%. |
| Researcher Affiliation | Collaboration | Jesse Michael Han University of Pittsburgh Open AI Jason Rute IBM Research Yuhuai Wu Google Research Stanford University Edward W. Ayers Carnegie Mellon University Stanislas Polu Open AI |
| Pseudocode | No | The paper describes the 'best-first search algorithm' in Section 3.3 but presents it in paragraph form rather than as a structured pseudocode or algorithm block. |
| Open Source Code | Yes | The source code used to generate the Lean datasets and run the evaluation is open source and made available in the following repositories: Lean theorem proving environment : https://github.com/jesse-michael-han/lean-tpe-public Tactic step data pipeline : https://github.com/jasonrute/lean_proof_recording PACT data pipeline : https://github.com/jesse-michael-han/lean-step-public |
| Open Datasets | No | LEANSTEP is the first and only tactic proof dataset for the Lean theorem prover. This makes available a large set of formal mathematical data to researchers covering a diverse and deep spectrum of pure mathematics. (The paper explains how the dataset is generated and provides code for generating it, but not a direct link to the processed dataset itself.) |
| Dataset Splits | Yes | We use an 80-5-15 train-validation-test split. We split all datapoints deterministically by theorem name, by hashing each name to a float in (0, 1). |
| Hardware Specification | Yes | Fine-tuning on our largest dataset mix1 + mix2 + tactic required 26 hours using 64 A100 GPUs exhibiting high FP16 usage, totalling an estimated 1.5K A100(FP16)-hours. Each evaluation required 10 hours with 30% GPU utilization. We observed that our evaluation was bottlenecked by inference and in practice, we hosted up to three evaluation loops at once on a VM with 80 logical cores without achieving full CPU utilization. |
| Software Dependencies | No | The provided code also supports querying a locally hosted Transformer from the open-source library fairseq via the Fairseq CLI (Ott et al., 2019). (The paper mentions 'fairseq' but does not provide a specific version number for it or any other key software dependency.) |
| Experiment Setup | Yes | Unless mentioned otherwise, all of our models have 24 layers with dmodel = 1536 and 24 heads, accruing to 837M trainable parameters... We use the standard BPE encoding (Brown et al., 2020), a batch size of 512 and a learning rate of 0.00025 with a cosine schedule and a 100-step ramp-up. |