Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

miniF2F-Lean Revisited: Reviewing Limitations and Charting a Path Forward

Authors: Azim Ospanov, Farzan Farnia, Roozbeh Mohit

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform a thorough analysis of the formal and informal statements in the mini F2F benchmark from the perspective of an AI system that is tasked to participate in a math Olympiad consisting of the problems in mini F2F. Our evaluation results reveal that the best accuracy of such pipeline can be about 36% using the So TA models in the literature, considerably lower than the individual So TA accuracies, 97% and 69% reported in the autoformalization and theorem proving literature. We perform a thorough evaluation of autoformalization models on mini F2F-v1, v2s, and v2c, demonstrating the current shortcomings in the evaluation practices in the autoformalization literature as well as the benefits of mini F2F-v2s and v2c.
Researcher Affiliation	Collaboration	Azim Ospanov EMAIL Farzan Farnia EMAIL Roozbeh Yousefzadeh EMAIL Huawei Hong Kong Research Center Department of Computer Science & Engineering, The Chinese University of Hong Kong
Pseudocode	No	The paper describes methods for evaluating AI systems for formal reasoning and analyzes a benchmark dataset, but it does not present any structured pseudocode or algorithm blocks for its own methodology.
Open Source Code	No	Our dataset is available at https://github.com/roozbeh-yz/miniF2F_v2. We plan to release the proposed benchmark along with the proofs. Public Git Hub and Hugging Face links will be provided at a later date.
Open Datasets	Yes	Our dataset is available at https://github.com/roozbeh-yz/miniF2F_v2.
Dataset Splits	Yes	The dataset follows the same split as the original version, i.e. 244 test instances and 244 validation instances.
Hardware Specification	Yes	All experiments were performed on eight NVIDIA A5000 GPUs with 128 CPU cores.
Software Dependencies	Yes	Deepseek Prover-V1.5-RL [21], Goedel-Prover-SFT [22] and Deepseek-Prover-V2-7B [23] are evaluated under Lean v4.9.0, matching the versions used in their original papers, whereas Kimina-Prover-Distill7B [12] is tested with the newer Lean v4.17.0. Lean compiler of choice is Lean REPL [16].
Experiment Setup	Yes	All experiments use a sampling budget of @1 or @128 for the dedicated autoformalizer models and @10 (with intermediate compiler feedback) for o4-mini. We note that although o4-mini has a different sample budget, we stop at the first successful compilation attempt, which puts it on par with @1 Herald translator and Kimina autoformalizer models. Following the literature, we use a sampling budget of @32 in all runs.