Autoformalization with Large Language Models
Authors: Yuhuai Wu, Albert Qiaochu Jiang, Wenda Li, Markus Rabe, Charles Staats, Mateja Jamnik, Christian Szegedy
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our methodology results in a new state-of-the-art result on the Mini F2F theorem proving benchmark, improving the proof rate from 29.6% to 35.2%. |
| Researcher Affiliation | Collaboration | 1Google Research 2Stanford University 3University of Cambridge |
| Pseudocode | No | The paper describes the expert iteration procedure in textual form in Section 5.1, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We include the code we used to train the models in the supplemental material. |
| Open Datasets | Yes | MATH [24] contains in total 12,500 (7,500 training and 5,000 test) middle school and high school mathematical competition problems. ... Mini F2F [52] is a recently introduced benchmark containing 488 mathematical competition statements manually formalized by humans in three different formal languages. |
| Dataset Splits | No | The paper mentions evaluating on 'the validation and test fractions of mini F2F' in Section 5.2, and that 'The model s evaluation loss reaches a minimum after 13,000 steps and we use that checkpoint,' implying a validation set was used. However, it does not specify the explicit size, percentage, or specific criteria for this validation split to allow for reproduction. |
| Hardware Specification | Yes | For pre-training, fine-tuning, and evaluation, we use a TPUv3 with 8 cores from Google Cloud Platform. The Isabelle process has access to up to 32 CPU cores. |
| Software Dependencies | No | The paper mentions using 'Wang [45] s implementation... of a GPT-2 [38] style decoder-only transformer [44] model,' which points to 'Mesh-Transformer-JAX' in the references, implying JAX as a dependency. However, it does not provide specific version numbers for JAX or other critical software libraries (e.g., PyTorch, TensorFlow) used in the experiments. |
| Experiment Setup | Yes | We pre-train the model on the Git Hub + ar Xiv subsets of The Pile [14] for 200,000 steps, with a context length of 2048 tokens. ... We use a warmup strategy [18], raising the learning rate linearly from 0 to 2 × 10^−4 in 3,000 steps. We then use a cosine learning rate scheduler [34] for the rest of the pre-training, with a final learning rate of 1.2 × 10^−5. We use a global batch size of 32 sequences, or 65,536 tokens. For fine-tuning we use the same learning rate schedule, with 10,000 warmup steps, 90,000 annealing steps, maximum learning rate 3 × 10^−4 and final learning rate 3 × 10^−5. The global batch size is 144 sequences, or 294,912 tokens. |