Multi-language Diversity Benefits Autoformalization

Authors: Albert Q. Jiang, Wenda Li, Mateja Jamnik

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that language models fine-tuned on MMA can produce up to 29 31% of statements acceptable with minimal corrections on the mini F2F and Proof Net benchmarks, up from 0% with the base model.
Researcher Affiliation Academia Albert Q. Jiang University of Cambridge qj213@cam.ac.uk Wenda Li University of Edinburgh wenda.li@ed.ac.uk Mateja Jamnik University of Cambridge mateja.jamnik@cl.cam.ac.uk
Pseudocode No The paper does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes We release the fine-tuned models for inference. We also release the MMA dataset for people to train their autoformalization models on, and to enrich MMA with more domains and languages. The MMA dataset and the fine-tuned models are available from the official repository: MMA.
Open Datasets Yes created a parallel dataset of 332K informal-formal pairs, which we refer to as the MMA (Multi-language Mathematical Autoformalization) dataset. The MMA dataset and the fine-tuned models are available from the official repository: MMA. We release MMA for public exploration.
Dataset Splits No In Figure 1, we plot the loss and the token accuracy with teacher-forcing [Goyal et al., 2016] for the LLa MA model, on the Isabelle and the Lean4 validation sets for all 3 models. The paper mentions using validation sets but does not explicitly provide details about the training/test/validation dataset splits (e.g., percentages, sample counts, or splitting methodology) for the MMA dataset.
Hardware Specification Yes We use the Easy LM [Geng, 2023] software framework on a TPUv4-64, with 32 megacores.
Software Dependencies No The paper mentions using 'Easy LM [Geng, 2023] software framework' and 'Adam W optimiser [Loshchilov and Hutter, 2019]' but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes For fine-tuning, we use the cross-entropy loss with the loss on the input masked out. We parallelise the model across 16 devices, and use a local batch size of 8 sequences, with each sequence having a maximum of 512 tokens. We use the Adam W optimiser [Loshchilov and Hutter, 2019], perform 5000 linear warmup steps with a peak learning rate of 3 10 5, and then decay the learning rate with a cosine schedule for 35000 steps to 3 10 6.