Autoformalization with Large Language Models

Authors: Yuhuai Wu, Albert Qiaochu Jiang, Wenda Li, Markus Rabe, Charles Staats, Mateja Jamnik, Christian Szegedy

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our methodology results in a new state-of-the-art result on the Mini F2F theorem proving benchmark, improving the proof rate from 29.6% to 35.2%.
Researcher Affiliation Collaboration 1Google Research 2Stanford University 3University of Cambridge
Pseudocode No The paper describes the expert iteration procedure in textual form in Section 5.1, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes We include the code we used to train the models in the supplemental material.
Open Datasets Yes MATH [24] contains in total 12,500 (7,500 training and 5,000 test) middle school and high school mathematical competition problems. ... Mini F2F [52] is a recently introduced benchmark containing 488 mathematical competition statements manually formalized by humans in three different formal languages.
Dataset Splits No The paper mentions evaluating on 'the validation and test fractions of mini F2F' in Section 5.2, and that 'The model s evaluation loss reaches a minimum after 13,000 steps and we use that checkpoint,' implying a validation set was used. However, it does not specify the explicit size, percentage, or specific criteria for this validation split to allow for reproduction.
Hardware Specification Yes For pre-training, fine-tuning, and evaluation, we use a TPUv3 with 8 cores from Google Cloud Platform. The Isabelle process has access to up to 32 CPU cores.
Software Dependencies No The paper mentions using 'Wang [45] s implementation... of a GPT-2 [38] style decoder-only transformer [44] model,' which points to 'Mesh-Transformer-JAX' in the references, implying JAX as a dependency. However, it does not provide specific version numbers for JAX or other critical software libraries (e.g., PyTorch, TensorFlow) used in the experiments.
Experiment Setup Yes We pre-train the model on the Git Hub + ar Xiv subsets of The Pile [14] for 200,000 steps, with a context length of 2048 tokens. ... We use a warmup strategy [18], raising the learning rate linearly from 0 to 2 × 10^−4 in 3,000 steps. We then use a cosine learning rate scheduler [34] for the rest of the pre-training, with a final learning rate of 1.2 × 10^−5. We use a global batch size of 32 sequences, or 65,536 tokens. For fine-tuning we use the same learning rate schedule, with 10,000 warmup steps, 90,000 annealing steps, maximum learning rate 3 × 10^−4 and final learning rate 3 × 10^−5. The global batch size is 144 sequences, or 294,912 tokens.