REFACTOR: Learning to Extract Theorems from Proofs
Authors: Jin Peng Zhou, Yuhuai Wu, Qiyang Li, Roger Baker Grosse
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we propose a novel method called theo REm-from-proo F extr ACTOR (REFACTOR) for training neural networks to mimic this ability in formal mathematical theorem proving. We show on a set of unseen proofs, REFACTOR is able to extract 19.6% of the theorems that humans would use to write the proofs. When applying the model to the existing Metamath library, REFACTOR extracted 16 new theorems. With newly extracted theorems, we show that the existing proofs in the Meta Math database can be refactored. The new theorems are used very frequently after refactoring, with an average usage of 733.5 times, and help shorten the proof lengths. Lastly, we demonstrate that the prover trained on the new-theorem refactored dataset proves more test theorems and outperforms state-of-the-art baselines by frequently leveraging a diverse set of newly extracted theorems. |
| Researcher Affiliation | Collaboration | Jin Peng Zhou Cornell University Yuhuai Wu x AI Qiyang Li University of California, Berkeley Roger Grosse University of Toronto Vector Institute |
| Pseudocode | Yes | Algorithm 1 Theorem Expansion Algorithm Pseudocode |
| Open Source Code | Yes | Code can be found at https://github.com/jinpz/refactor. |
| Open Datasets | Yes | We applied REFACTOR to create datasets from the main and largest library of Metamath, set.mm. In order to fairly compare prover performance reported from Whalen (2016), we used their version of set.mm, which contains 27220 theorems. |
| Dataset Splits | No | The paper states, 'we split the dataset in a way that the prediction targets, namely, the theorems to be extracted, are non-overlapping for the train, valid and test set.' and mentions subsampling for balancing, but does not provide specific percentages or absolute counts for the training, validation, and test splits, nor does it reference predefined splits with exact ratios. |
| Hardware Specification | Yes | We ran all experiments on one NVIDIA Quadro RTX 6000, with 4-core CPUs. |
| Software Dependencies | No | The paper states, 'All methods were implemented in Pytorch Paszke et al. (2019) and Pytorch Geometric library Fey & Lenssen (2019),' but it does not specify the version numbers for these software libraries. |
| Experiment Setup | Yes | The size of the character embedding was set to 128 and the number of hidden neurons in all the fully connected layers was set to 64. Both K and d are hyperparameters. For model training, we used a learning rate of 1e-4 with Adam optimizer (Kingma & Ba, 2015). |