Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

3D-Prover: Diversity Driven Theorem Proving With Determinantal Point Processes

Authors: Sean Lamont, Christian Walder, Amir Dezfouli, Paul Montague, Michael Norrish

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of 3D-Prover on the mini F2F and Lean Dojo benchmarks by augmenting popular open source proving LLMs. We show that our approach leads to an increase in the overall proof rate, as well as a significant improvement in the tactic success rate, execution time and diversity. We run our experiments in Lean (De Moura et al., 2015) using the BAIT (Lamont et al., 2024) platform with a modified Lean Dojo (Yang et al., 2023) environment, where we set an environment timeout of 600 seconds per proof attempt.
Researcher Affiliation Collaboration Sean Lamont1,2, Christian Walder3, Amir Dezfouli4, Paul Montague2, Michael Norrish1 1Australian National University 2Defence Science and Technology Group 3Google Deep Mind 4BIMLOGIQ EMAIL
Pseudocode Yes Algorithm 1: 3D-Prover Input : Goal g, candidate tactics T = {ti}N i=1, filter size K, Encoder E, Predictor P, error weight λs, time weight λτ, temperature θ, tactic policy π0 Output :Filtered tactics T T // Compute embeddings and scores for each tactic for i in {1, . . . , N} do ϕi E(g, ti) (si, τi) P(ϕi) τi 1 τi ||τ||, τ = (τ1, .., τN) mi exp(π0(ti|g)/θ) PN j=1 exp(π0(tj|g)/θ) qi mi + λssi + λττi // Compute kernel matrix L BT B, where B = [q1ϕ1, . . . , q NϕN] Compute eigenvalues λi and eigenvectors vi of L Sample J {1, . . . , N} using Algorithm 2 of Kulesza & Taskar (2011), with {(vi, λi)}, k = K return T = {tj}j J
Open Source Code Yes We make our code available at https://github.com/sean-lamont/3D-Prover.
Open Datasets Yes We demonstrate the effectiveness of 3D-Prover on the mini F2F and Lean Dojo benchmarks by augmenting popular open source proving LLMs.
Dataset Splits Yes We obtain D from a single Re Prover attempt on mini F2F-valid, yielding 498,236 transitions split randomly into 95% training, 5% testing.
Hardware Specification Yes For traning our transition models, we used a singe RTX4090 GPU with an Intel i9 13900k processor. For evaluating, we used two internal machines each with two RTXA6000 GPUs, and a Intel Xeon W-2223.
Software Dependencies No The paper mentions 'Lean (De Moura et al., 2015)', 'Re Prover (Yang et al., 2023)' and 'Intern LM2.5-Step-Prover (Wu et al., 2024)' but does not specify software versions for programming languages, libraries, or frameworks used in their implementation.
Experiment Setup Yes We implement the Predictor P as a single hidden layer MLP, with hidden dimension d/2 (where d = 1472) and two real valued output nodes. The time prediction ˆτ is the output of the first node, and the status prediction ˆs is taken as the sigmoid of the second. [...] We use the Adam W optimizer, with a learning rate of 10 5 and a batch size of 1, train for 2 epochs, and report the results on the test set. We set an environment timeout of 600 seconds per proof attempt.