Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving

Authors: Kefan Dong, Tengyu Ma

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate STP with both Lean and Isabelle formal versifiers. With 51.3 billion tokens generated during the training in Lean, STP proves 28.5% of the statements in the Lean Workbook dataset, doubling the previous best result of 13.1% achieved through expert iteration. The final model achieves state-of-the-art performance among whole-proof generation methods on mini F2F-test (65.0%), Proof Net-test (23.9%) and Putnam Bench (8/644) with pass@3200.
Researcher Affiliation	Academia	1Stanford University. Correspondence to: Kefan Dong <EMAIL>, Tengyu Ma <EMAIL>.
Pseudocode	Yes	Our self-play training stage of STP is shown in Fig. 1. The main difference compared to expert iteration is the conjecturer in Steps 1 and 4, highlighted in a yellow background. ... (See the pseudocode and details in Appendix A.2.) ... Pseudo-code of this step is in Appendix A.3, and an efficient implementation is in Appendix A.5.
Open Source Code	Yes	To facilitate future research, we release our code, model, and dataset in this url: https://github.com/kfdong/STP.
Open Datasets	Yes	Our primary source of statements without proofs is the de-duplicated Lean Workbook (Ying et al., 2024), which contains around 89K Lean4 statements (see Appendix A.4 for details). For the Isabelle experiments, we translate the Lean4 statements to Isabelle using the Deep Seek V2.5 with few-shot prompting. For the Lean experiments, we combine Lean Workbook, mini F2F-valid, and Proof Net-valid as the training dataset for STP.
Dataset Splits	Yes	Our primary source of statements without proofs is the de-duplicated Lean Workbook (Ying et al., 2024), which contains around 89K Lean4 statements (see Appendix A.4 for details). For the Isabelle experiments, we translate the Lean4 statements to Isabelle using the Deep Seek V2.5 with few-shot prompting. For the Lean experiments, we combine Lean Workbook, mini F2F-valid, and Proof Net-valid as the training dataset for STP. ... In each iteration of STP, we sample K = 32 proofs per conjecture/statement.
Hardware Specification	Yes	Our experiments are primarily done on TPU-v4 VMs with 32 nodes. Each node contains 4 TPU chips (8 TPU cores), 240 CPU cores, and 400G memory.
Software Dependencies	No	We use v LLM (Kwon et al., 2023) to generate LLM outputs, and Levanter11 to train the LLM. In both STP and expert iteration, since the generated proofs are heavily filtered (based on the correctness, elegancy, trivialness, etc.) when constructing the training dataset, LLM training only takes less than 25% of the wall-clock time for TPU compute, and generating proofs takes the rest 75%. ... We use PISA (Jiang et al., 2021) to interact with Isabelle, and enforce a 10s timeout for any proof step and 360s timeout for entire proofs.
Experiment Setup	Yes	For inference, we cap the number of generated tokens to 1024, and set the sampling temperature to 0.7 for Llemma-7b and 1.0 for Deep Seek-Prover, following Dong et al. (2024); Xin et al. (2024b), respectively. For training, we use batch size 2048 and Adam (Kingma & Ba, 2014) with a constant learning rate of 5e-5 in STP, and 1e-4 in SFT and final re-training. The discount factors are γ = exp( 0.001) and β = exp( 0.01) In each iteration of STP, we sample K = 32 proofs per conjecture/statement. For the expert iteration and parallel sampling, we use K = 64.