LEGO-Prover: Neural Theorem Proving with Growing Libraries

Authors: Haiming Wang, Huajian Xin, Chuanyang Zheng, Zhengying Liu, Qingxing Cao, Yinya Huang, Jing Xiong, Han Shi, Enze Xie, Jian Yin, Zhenguo Li, Xiaodan Liang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental LEGO-Prover advances the state-of-the-art pass rate on mini F2F-valid (48.0% to 57.0%) and mini F2F-test (45.5% to 50.0%). During the proving process, LEGO-Prover also generates over 20,000 skills (theorems/lemmas) and adds them to the growing library. Our ablation study indicates that these newly added skills are indeed helpful for proving theorems, resulting in a 4.9% improvement in success rate.
Researcher Affiliation Collaboration 1Sun Yat-sen University 2Huawei Noah s Ark Lab 3The Chinese University of Hong Kong 4City University of Hong Kong 5MBZUAI 6Dark Matter AI Research
Pseudocode Yes As illustrated in Fig. 2 (a) and Algorithm 2, the prover employs a three-step process to construct the proof. ... Algorithm 3 also illustrates the overall process for the evolver.
Open Source Code Yes Code available at https://github.com/wiio12/LEGO-Prover.
Open Datasets Yes We conduct extensive experiments on the popular theorem-proving dataset mini F2F (Zheng et al., 2021).
Dataset Splits Yes The problems are divided into valid and test sets, with 244 problems each.
Hardware Specification No The paper mentions using "Chat GPT (GPT-3.5)" and "Open AI s text-davinci-ada embedding model," which implies using cloud-based LLM services, but does not specify any particular hardware like GPU models, CPU types, or memory used for running the experiments.
Software Dependencies Yes A combination of gpt-3.5-turbo, gpt-3.5-turbo-0301, gpt-3.5-turbo-0613, gpt-3.5-turbo-16k, and gpt-3.5turbo-16k-0613 is employed, with a model being selected randomly during calls to the Open AI API. ... Practically, Chroma DB serves as our vector store, coupled with the Open AI s text-davinci-ada embedding model. ... For interacting with the Isabelle theorem prover, we employ the PISA environment (Jiang et al., 2021).
Experiment Setup Yes each problem undergoes 100 attempts of proving using Chat GPT (GPT-3.5). ... The temperature is consistently set at T = 0.7 across all procedures. Within the prover, 3-shot examples are leveraged for the decomposer. Regarding the formalizer, the quantity of reference skills nf is set to 6 for the valid set and 4 for the test set, and paired with 2 formalization in-context examples. For the directional transformer, the number of reference problem statements nd is set to 4, supplemented by two directional transformation in-context examples. For the request solver, 3 skills are retrieved for forming the in-context demonstration examples.