Subgoal-based Demonstration Learning for Formal Theorem Proving

Authors: Xueliang Zhao, Wenda Li, Lingpeng Kong

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our integration of subgoalbased learning has notably increased proof accuracy from 38.9% to 44.1% on the mini F2F benchmark. Furthermore, the adoption of diffusion models for demonstration organization can lead to an additional enhancement in accuracy to 45.5%, or a 5 improvement in sampling efficiency compared to previously established methods.
Researcher Affiliation Academia 1The University of Hong Kong 2University of Edinburgh. Correspondence to: Xueliang Zhao <xlzhao22@connect.hku.hk>.
Pseudocode Yes We provide a detailed description of the subgoal refinement method (see 2.1) through Algorithm 1. In the k-th iteration, we construct demonstration examples {E(k) i }N i=1 using improved subgoal-based proofs.
Open Source Code No The paper does not provide a direct link to open-source code for the methodology or explicitly state its release.
Open Datasets Yes We evaluate our approach using the mini F2F dataset (Zheng et al., 2021), which comprises 488 formal mathematical problems derived from high-school competitions, expressed in three formal languages: Lean, HOLLight, and Isabelle. ... Given the absence of a training split in the mini F2F dataset, we leverage optimal organizations that yield successful proofs from the mini F2F-valid set to train the diffusion model.
Dataset Splits Yes The dataset is divided into a validation and a test set, each including 244 problems. ... We then partition the corresponding demonstration contexts into a training set and a validation set, comprising 81 and 56 instances respectively. The training of our diffusion models is conducted with a learning rate of 5e 4, a batch size of 16, and throughout 50 epochs. We employ an early stopping strategy on the validation set and report the performance averaged in three repetitive experiments.
Hardware Specification Yes The processes of training and inference for the diffusion models are conducted on an NVIDIA RTX 3090 GPU. ... This analysis was carried out on a machine with 64 CPU cores, focusing on the average number of Sledgehammer calls and their execution times for each solved statement.
Software Dependencies Yes Throughout our work, we employ Chat GPT (i.e., the gpt-3.5-turbo-0301 version) as the LLM. ... We adhere to the default configuration of Sledgehammer as provided in Isabelle2021, which encompasses a 120-second timeout and a suite of five automated theorem provers (Z3, CVC4, SPASS, Vampire, E).
Experiment Setup Yes For the creation of the formal sketch, the temperature and max tokens parameters of gpt-3.5-turbo-0301 are set to 0 and 1024, respectively. For each subgoal proof, we make one formal sketch attempt, as suggested by previous literature (Jiang et al., 2023). In terms of the establishment of the subgoal-based proof, we set the number of refinement iterations to be 15, with the number of demonstration examples, denoted as N, being set to 61. ... The training of our diffusion models is conducted with a learning rate of 5e 4, a batch size of 16, and throughout 50 epochs. We set the number of diffusion steps, represented as T, to 80.