MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data

Authors: Yinya Huang, Xiaohan Lin, Zhengying Liu, Qingxing Cao, Huajian Xin, Haiming Wang, Zhenguo Li, Linqi Song, Xiaodan Liang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data. We further apply the MUSTARDSAUCE for fine-tuning smaller language models. and We conduct extensive data analysis and experiments on the generated MUSTARDSAUCE.
Researcher Affiliation Collaboration 1City University of Hong Kong 2Shenzhen Campus of Sun Yat-sen University 3Huawei Noah s Ark Lab 4Dark Matter AI Research 5MBZUAI 6City USRI
Pseudocode Yes Table 13 demonstrates the prompt template used in the proof-filtering stage in MUSTARD. and Table 14 demonstrates the variation of prompt templates used in the proof-generation stage in MUSTARD. These tables provide structured steps in a code-like format (e.g., line 1 <code>).
Open Source Code Yes Codes and data are available at: https://github.com/Eleanor-H/MUSTARD.
Open Datasets Yes For the task of math word problems, we use GSM8K (Cobbe et al., 2021) and MATH dataset Hendrycks et al. (2021)3 for evaluation. For evaluating automated theorem proving, we use Mathlib4 and the mini F2F (Zheng et al., 2022) benchmark. and Mathlib6 is a community-maintained library designed for the Lean theorem prover. It encompasses both programming tools and mathematical content, along with tactics that leverage these tools to facilitate mathematical development. The version of Mathlib we use is consistent with Wang et al. (2023a).
Dataset Splits Yes Moreover, we randomly split MUSTARDSAUCEvalid into 4,866 training data, 500 validation data, and 500 test data for benchmarking model performances on the dataset.
Hardware Specification No The paper does not explicitly describe the hardware (e.g., specific GPU models, CPU types, or memory) used to run the experiments. It mentions models like Llama 2-7B and Llama 2-70B but not the underlying hardware.
Software Dependencies No The paper mentions software components like 'LoRA', 'GPT2-large', 'Llama 2', and 'Lean Prover', but does not provide specific version numbers for these or other software dependencies (e.g., 'PyTorch 1.9' or 'Lean Prover version X.Y.Z').
Experiment Setup Yes The training is conducted with a maximum of 10 epochs using a batch size of 16 and a warm-up step of 1,000, with a maximum learning rate of 1e-4 and a minimum learning rate of 5e-6.