Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Autoformalization with Large Language Models
Authors: Yuhuai Wu, Albert Qiaochu Jiang, Wenda Li, Markus Rabe, Charles Staats, Mateja Jamnik, Christian Szegedy
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our methodology results in a new state-of-the-art result on the Mini F2F theorem proving benchmark, improving the proof rate from 29.6% to 35.2%. |
| Researcher Affiliation | Collaboration | 1Google Research 2Stanford University 3University of Cambridge |
| Pseudocode | No | The paper describes the expert iteration procedure in textual form in Section 5.1, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We include the code we used to train the models in the supplemental material. |
| Open Datasets | Yes | MATH [24] contains in total 12,500 (7,500 training and 5,000 test) middle school and high school mathematical competition problems. ... Mini F2F [52] is a recently introduced benchmark containing 488 mathematical competition statements manually formalized by humans in three different formal languages. |
| Dataset Splits | No | The paper mentions evaluating on 'the validation and test fractions of mini F2F' in Section 5.2, and that 'The model s evaluation loss reaches a minimum after 13,000 steps and we use that checkpoint,' implying a validation set was used. However, it does not specify the explicit size, percentage, or specific criteria for this validation split to allow for reproduction. |
| Hardware Specification | Yes | For pre-training, ο¬ne-tuning, and evaluation, we use a TPUv3 with 8 cores from Google Cloud Platform. The Isabelle process has access to up to 32 CPU cores. |
| Software Dependencies | No | The paper mentions using 'Wang [45] s implementation... of a GPT-2 [38] style decoder-only transformer [44] model,' which points to 'Mesh-Transformer-JAX' in the references, implying JAX as a dependency. However, it does not provide specific version numbers for JAX or other critical software libraries (e.g., PyTorch, TensorFlow) used in the experiments. |
| Experiment Setup | Yes | We pre-train the model on the Git Hub + ar Xiv subsets of The Pile [14] for 200,000 steps, with a context length of 2048 tokens. ... We use a warmup strategy [18], raising the learning rate linearly from 0 to 2 Γ 10^β4 in 3,000 steps. We then use a cosine learning rate scheduler [34] for the rest of the pre-training, with a final learning rate of 1.2 Γ 10^β5. We use a global batch size of 32 sequences, or 65,536 tokens. For ο¬ne-tuning we use the same learning rate schedule, with 10,000 warmup steps, 90,000 annealing steps, maximum learning rate 3 Γ 10^β4 and final learning rate 3 Γ 10^β5. The global batch size is 144 sequences, or 294,912 tokens. |