Multi-language Diversity Benefits Autoformalization
Authors: Albert Q. Jiang, Wenda Li, Mateja Jamnik
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that language models fine-tuned on MMA can produce up to 29 31% of statements acceptable with minimal corrections on the mini F2F and Proof Net benchmarks, up from 0% with the base model. |
| Researcher Affiliation | Academia | Albert Q. Jiang University of Cambridge qj213@cam.ac.uk Wenda Li University of Edinburgh wenda.li@ed.ac.uk Mateja Jamnik University of Cambridge mateja.jamnik@cl.cam.ac.uk |
| Pseudocode | No | The paper does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release the fine-tuned models for inference. We also release the MMA dataset for people to train their autoformalization models on, and to enrich MMA with more domains and languages. The MMA dataset and the fine-tuned models are available from the official repository: MMA. |
| Open Datasets | Yes | created a parallel dataset of 332K informal-formal pairs, which we refer to as the MMA (Multi-language Mathematical Autoformalization) dataset. The MMA dataset and the fine-tuned models are available from the official repository: MMA. We release MMA for public exploration. |
| Dataset Splits | No | In Figure 1, we plot the loss and the token accuracy with teacher-forcing [Goyal et al., 2016] for the LLa MA model, on the Isabelle and the Lean4 validation sets for all 3 models. The paper mentions using validation sets but does not explicitly provide details about the training/test/validation dataset splits (e.g., percentages, sample counts, or splitting methodology) for the MMA dataset. |
| Hardware Specification | Yes | We use the Easy LM [Geng, 2023] software framework on a TPUv4-64, with 32 megacores. |
| Software Dependencies | No | The paper mentions using 'Easy LM [Geng, 2023] software framework' and 'Adam W optimiser [Loshchilov and Hutter, 2019]' but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | For fine-tuning, we use the cross-entropy loss with the loss on the input masked out. We parallelise the model across 16 devices, and use a local batch size of 8 sequences, with each sequence having a maximum of 512 tokens. We use the Adam W optimiser [Loshchilov and Hutter, 2019], perform 5000 linear warmup steps with a peak learning rate of 3 10 5, and then decay the learning rate with a cosine schedule for 35000 steps to 3 10 6. |