reproducibilityindex.ai

Proof Artifact Co-Training for Theorem Proving with Language Models

Authors: Jesse Michael Han, Jason Rute, Yuhuai Wu, Edward Ayers, Stanislas Polu

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We train large language models on these data and demonstrate that PACT signiﬁcantly improves theorem proving success rate on a held-out suite of test theorems, from 32% to 48%.
Researcher Affiliation	Collaboration	Jesse Michael Han University of Pittsburgh Open AI Jason Rute IBM Research Yuhuai Wu Google Research Stanford University Edward W. Ayers Carnegie Mellon University Stanislas Polu Open AI
Pseudocode	No	The paper describes the 'best-ﬁrst search algorithm' in Section 3.3 but presents it in paragraph form rather than as a structured pseudocode or algorithm block.
Open Source Code	Yes	The source code used to generate the Lean datasets and run the evaluation is open source and made available in the following repositories: Lean theorem proving environment : https://github.com/jesse-michael-han/lean-tpe-public Tactic step data pipeline : https://github.com/jasonrute/lean_proof_recording PACT data pipeline : https://github.com/jesse-michael-han/lean-step-public
Open Datasets	No	LEANSTEP is the ﬁrst and only tactic proof dataset for the Lean theorem prover. This makes available a large set of formal mathematical data to researchers covering a diverse and deep spectrum of pure mathematics. (The paper explains how the dataset is generated and provides code for generating it, but not a direct link to the processed dataset itself.)
Dataset Splits	Yes	We use an 80-5-15 train-validation-test split. We split all datapoints deterministically by theorem name, by hashing each name to a ﬂoat in (0, 1).
Hardware Specification	Yes	Fine-tuning on our largest dataset mix1 + mix2 + tactic required 26 hours using 64 A100 GPUs exhibiting high FP16 usage, totalling an estimated 1.5K A100(FP16)-hours. Each evaluation required 10 hours with 30% GPU utilization. We observed that our evaluation was bottlenecked by inference and in practice, we hosted up to three evaluation loops at once on a VM with 80 logical cores without achieving full CPU utilization.
Software Dependencies	No	The provided code also supports querying a locally hosted Transformer from the open-source library fairseq via the Fairseq CLI (Ott et al., 2019). (The paper mentions 'fairseq' but does not provide a specific version number for it or any other key software dependency.)
Experiment Setup	Yes	Unless mentioned otherwise, all of our models have 24 layers with dmodel = 1536 and 24 heads, accruing to 837M trainable parameters... We use the standard BPE encoding (Brown et al., 2020), a batch size of 512 and a learning rate of 0.00025 with a cosine schedule and a 100-step ramp-up.