Proving Theorems Recursively

Authors: Haiming Wang, Huajian Xin, Zhengying Liu, Wenda Li, Yinya Huang, Jianqiao Lu, Zhicheng Yang, Jing Tang, Jian Yin, Zhenguo Li, Xiaodan Liang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments are conducted on the mini F2F and PISA datasets and significant performance gains are observed in our POETRY approach over state-of-the-art methods. POETRY on mini F2F achieves an average proving success rate improvement of 5.1%.
Researcher Affiliation Collaboration 1Sun Yat-sen University 2Huawei Noahs Ark Lab 3University of Edinburgh 4City U 5HKU 6HKUST (Guangzhou) 7HKUST 8Pengcheng Laboratory
Pseudocode Yes Algorithm 1 Core data curation process
Open Source Code Yes By directly releasing the code, model and data, we aim to ensure the responsible use of our work, fostering further innovation and maintaining high standards of data privacy and intellectual property compliance.2https://github.com/wiio12/POETRY
Open Datasets Yes We conduct extensive experiments on the theorem proving datasets mini F2F [Zheng et al., 2021] and PISA [Jiang et al., 2021] to validate the effectiveness of our proposed approach.
Dataset Splits Yes The problems are divided into valid and test sets, with 244 problems each. POETRY significantly outperforms previous approaches, achieving a pass rate of 42.2% on both the mini F2F valid and test datasets, respectively. Table 3: Dataset statistics. The table displays the dataset statistics for our newly extracted PISA dataset based on Isabelle 2022. train valid test single-level multi-level
Hardware Specification Yes We use Nvidia A800 GPU with 80GB of GPU memory for fine-tuning. The training server has 104 CPU cores and 1024GB of CPU memory.
Software Dependencies Yes We choose Isabelle [Paulson, 1994] as our formal environment. As shown in Table 3, the newly constructed PISA dataset contains 3.02 million proof steps in the training data. In contrast, the old PISA dataset extracted by LISA [Jiang et al., 2021] only contains 2.49 million proof steps. Another interesting factor of the dataset statistics is the two subsets of the PISA test. The single-level test set contains 2/3 of the problems in the test set, but only 14% of the proof step. In contrast, the multi-level subset contains the remaining 86% proof steps.
Experiment Setup Yes During fine-tuning, we use a global batch size of 256 with 3500 steps of warmup using the Adam W optimizer. We use the cosine scheduling strategy with a maximum learning rate of 3e 4 and a minimum learning rate of 3e 5. Our model is finetuned with 100, 000 steps training budgets and inferences using the lowest validation loss checkpoints with early stopping. For the configuration of recursive best-first search. We use a global timeout of 600 seconds; each proof step has a timeout limit of 10 seconds. The number of samples per expansion e is set to 32, and we use beamsearch decoding strategies to sample proof steps from the language model. The maximum number of steps for expansion is set to 128, and the maximum recursive depth for searching deeper level proof is set to 10. For proof searches other than the first level, a local timeout of 120 seconds is also applied.