reproducibilityindex.ai

REFACTOR: Learning to Extract Theorems from Proofs

Authors: Jin Peng Zhou, Yuhuai Wu, Qiyang Li, Roger Baker Grosse

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we propose a novel method called theo REm-from-proo F extr ACTOR (REFACTOR) for training neural networks to mimic this ability in formal mathematical theorem proving. We show on a set of unseen proofs, REFACTOR is able to extract 19.6% of the theorems that humans would use to write the proofs. When applying the model to the existing Metamath library, REFACTOR extracted 16 new theorems. With newly extracted theorems, we show that the existing proofs in the Meta Math database can be refactored. The new theorems are used very frequently after refactoring, with an average usage of 733.5 times, and help shorten the proof lengths. Lastly, we demonstrate that the prover trained on the new-theorem refactored dataset proves more test theorems and outperforms state-of-the-art baselines by frequently leveraging a diverse set of newly extracted theorems.
Researcher Affiliation	Collaboration	Jin Peng Zhou Cornell University Yuhuai Wu x AI Qiyang Li University of California, Berkeley Roger Grosse University of Toronto Vector Institute
Pseudocode	Yes	Algorithm 1 Theorem Expansion Algorithm Pseudocode
Open Source Code	Yes	Code can be found at https://github.com/jinpz/refactor.
Open Datasets	Yes	We applied REFACTOR to create datasets from the main and largest library of Metamath, set.mm. In order to fairly compare prover performance reported from Whalen (2016), we used their version of set.mm, which contains 27220 theorems.
Dataset Splits	No	The paper states, 'we split the dataset in a way that the prediction targets, namely, the theorems to be extracted, are non-overlapping for the train, valid and test set.' and mentions subsampling for balancing, but does not provide specific percentages or absolute counts for the training, validation, and test splits, nor does it reference predefined splits with exact ratios.
Hardware Specification	Yes	We ran all experiments on one NVIDIA Quadro RTX 6000, with 4-core CPUs.
Software Dependencies	No	The paper states, 'All methods were implemented in Pytorch Paszke et al. (2019) and Pytorch Geometric library Fey & Lenssen (2019),' but it does not specify the version numbers for these software libraries.
Experiment Setup	Yes	The size of the character embedding was set to 128 and the number of hidden neurons in all the fully connected layers was set to 64. Both K and d are hyperparameters. For model training, we used a learning rate of 1e-4 with Adam optimizer (Kingma & Ba, 2015).