Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Proof Artifact Co-Training for Theorem Proving with Language Models
Authors: Jesse Michael Han, Jason Rute, Yuhuai Wu, Edward Ayers, Stanislas Polu
ICLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train large language models on these data and demonstrate that PACT significantly improves theorem proving success rate on a held-out suite of test theorems, from 32% to 48%. |
| Researcher Affiliation | Collaboration | Jesse Michael Han University of Pittsburgh Open AI Jason Rute IBM Research Yuhuai Wu Google Research Stanford University Edward W. Ayers Carnegie Mellon University Stanislas Polu Open AI |
| Pseudocode | No | The paper describes the 'best-first search algorithm' in Section 3.3 but presents it in paragraph form rather than as a structured pseudocode or algorithm block. |
| Open Source Code | Yes | The source code used to generate the Lean datasets and run the evaluation is open source and made available in the following repositories: Lean theorem proving environment : https://github.com/jesse-michael-han/lean-tpe-public Tactic step data pipeline : https://github.com/jasonrute/lean_proof_recording PACT data pipeline : https://github.com/jesse-michael-han/lean-step-public |
| Open Datasets | No | LEANSTEP is the first and only tactic proof dataset for the Lean theorem prover. This makes available a large set of formal mathematical data to researchers covering a diverse and deep spectrum of pure mathematics. (The paper explains how the dataset is generated and provides code for generating it, but not a direct link to the processed dataset itself.) |
| Dataset Splits | Yes | We use an 80-5-15 train-validation-test split. We split all datapoints deterministically by theorem name, by hashing each name to a float in (0, 1). |
| Hardware Specification | Yes | Fine-tuning on our largest dataset mix1 + mix2 + tactic required 26 hours using 64 A100 GPUs exhibiting high FP16 usage, totalling an estimated 1.5K A100(FP16)-hours. Each evaluation required 10 hours with 30% GPU utilization. We observed that our evaluation was bottlenecked by inference and in practice, we hosted up to three evaluation loops at once on a VM with 80 logical cores without achieving full CPU utilization. |
| Software Dependencies | No | The provided code also supports querying a locally hosted Transformer from the open-source library fairseq via the Fairseq CLI (Ott et al., 2019). (The paper mentions 'fairseq' but does not provide a specific version number for it or any other key software dependency.) |
| Experiment Setup | Yes | Unless mentioned otherwise, all of our models have 24 layers with dmodel = 1536 and 24 heads, accruing to 837M trainable parameters... We use the standard BPE encoding (Brown et al., 2020), a batch size of 512 and a learning rate of 0.00025 with a cosine schedule and a 100-step ramp-up. |