Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Lean-STaR: Learning to Interleave Thinking and Proving

Authors: Haohan Lin, Zhiqing Sun, Sean Welleck, Yiming Yang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Lean-STa R significantly outperforming base models (43.4% 46.3%, Pass@64). We also analyze the impact of the augmented thoughts on various aspects of the theorem proving process, providing insights into their effectiveness. ... Our main results are reported in Table 1.
Researcher Affiliation	Academia	1Language Technologies Institute, Carnegie Mellon University 2Institute for Interdisciplinary Information Sciences, Tsinghua University EMAIL
Pseudocode	No	The paper describes methods in prose and with diagrams (Figure 3 is a pipeline diagram) but does not contain any explicit pseudocode blocks or algorithms.
Open Source Code	No	The paper mentions "https://leanstar.github.io/" but does not contain an explicit statement confirming the release of the code for the described methodology or a direct link to a code repository. The provided URL points to a project homepage rather than a specific code repository.
Open Datasets	Yes	We instantiate Lean-STa R using the best available open language model pre-trained on the Lean corpus (Intern LM2-Math-base-7b (Ying et al., 2024)), and follow standard practice in using Lean s Mathlib as the underlying training set (via the Lean Dojo dataset (Yang et al., 2023)). We generate an initial set of thoughts for Mathlib using GPT-4, perform two rounds of expert iteration, then evaluate the model on mini F2F (Zheng et al., 2021) and leandojo (Yang et al., 2023), the de-facto standard benchmark for evaluating language-model based theorem provers.
Dataset Splits	Yes	We evaluate our method on the Mini F2F benchmark (Zheng et al., 2021). We use a similar evaluation setting as previous works (Yang et al., 2023; Welleck & Saha, 2023; Ying et al., 2024), but use our sampling method instead of best-first search for the evaluation of our thought-augmented theorem proving model as discussed in ( 3.3). ... We use a random subset of Leandojo4-v9-test (novel premises) with a size of 320 as test set of leandojo.
Hardware Specification	Yes	We collected 32, 231 different (proof state, thoughts, next-tactic) pairs in successful proof trajectories during expert iteration, which takes about 4 days with 8 A100 GPUs.
Software Dependencies	No	The paper mentions specific language models and theorem provers (e.g., Lean, GPT-4-0125, Intern LM2-Math-base-7b) but does not provide a comprehensive list of other software components like programming languages, libraries, or operating systems with their specific version numbers required for replication.
Experiment Setup	Yes	We fine-tune for 1 epoch to obtain the SFT model. For the learning rate, we use a warmup in the first 20% steps from 0 to 2 10 5, followed by a cosine schedule decaying to zero. ... In each iteration we use sampling on the Lean Dojo Benchmark 4 dataset, and save the (state, thought, tactic) examples that are part of successful proofs. For each problem, we sample K = 32 times in parallel with temperature T = 1.0, and limit the number of times a tactic can be generated to a total of N = 5 per problem. Also, sampling is limited to 1 minute per problem.