reproducibilityindex.ai

Autoformalization with Large Language Models

Authors: Yuhuai Wu, Albert Qiaochu Jiang, Wenda Li, Markus Rabe, Charles Staats, Mateja Jamnik, Christian Szegedy

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our methodology results in a new state-of-the-art result on the Mini F2F theorem proving benchmark, improving the proof rate from 29.6% to 35.2%.
Researcher Affiliation	Collaboration	1Google Research 2Stanford University 3University of Cambridge
Pseudocode	No	The paper describes the expert iteration procedure in textual form in Section 5.1, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We include the code we used to train the models in the supplemental material.
Open Datasets	Yes	MATH [24] contains in total 12,500 (7,500 training and 5,000 test) middle school and high school mathematical competition problems. ... Mini F2F [52] is a recently introduced benchmark containing 488 mathematical competition statements manually formalized by humans in three different formal languages.
Dataset Splits	No	The paper mentions evaluating on 'the validation and test fractions of mini F2F' in Section 5.2, and that 'The model s evaluation loss reaches a minimum after 13,000 steps and we use that checkpoint,' implying a validation set was used. However, it does not specify the explicit size, percentage, or specific criteria for this validation split to allow for reproduction.
Hardware Specification	Yes	For pre-training, ﬁne-tuning, and evaluation, we use a TPUv3 with 8 cores from Google Cloud Platform. The Isabelle process has access to up to 32 CPU cores.
Software Dependencies	No	The paper mentions using 'Wang [45] s implementation... of a GPT-2 [38] style decoder-only transformer [44] model,' which points to 'Mesh-Transformer-JAX' in the references, implying JAX as a dependency. However, it does not provide specific version numbers for JAX or other critical software libraries (e.g., PyTorch, TensorFlow) used in the experiments.
Experiment Setup	Yes	We pre-train the model on the Git Hub + ar Xiv subsets of The Pile [14] for 200,000 steps, with a context length of 2048 tokens. ... We use a warmup strategy [18], raising the learning rate linearly from 0 to 2 × 10^−4 in 3,000 steps. We then use a cosine learning rate scheduler [34] for the rest of the pre-training, with a final learning rate of 1.2 × 10^−5. We use a global batch size of 32 sequences, or 65,536 tokens. For ﬁne-tuning we use the same learning rate schedule, with 10,000 warmup steps, 90,000 annealing steps, maximum learning rate 3 × 10^−4 and final learning rate 3 × 10^−5. The global batch size is 144 sequences, or 294,912 tokens.