NaturalProver: Grounded Mathematical Proof Generation with Language Models
Authors: Sean Welleck, Jiacheng Liu, Ximing Lu, Hannaneh Hajishirzi, Yejin Choi
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We study large-scale language models on two new generation tasks: suggesting the next step in a mathematical proof, and full proof generation. We develop NATURALPROVER, a language model that generates proofs by conditioning on background references (e.g. theorems and definitions that are either retrieved or human-provided), and optionally enforces their presence with constrained decoding. On theorems from the NATURALPROOFS benchmark, NATURALPROVER improves the quality of next-step suggestions and generated proofs over fine-tuned GPT-3, according to human evaluations from university-level mathematics students. NATURALPROVER is capable of proving some theorems that require short (2-6 step) proofs, and providing next-step suggestions that are rated as correct and useful over 40% of the time, which is to our knowledge the first demonstration of these capabilities using neural language models.1 |
| Researcher Affiliation | Collaboration | Sean Welleck1,2 , Jiacheng Liu1 , Ximing Lu2, Hannaneh Hajishirzi1,2, Yejin Choi1,2 1Paul G. Allen School of Computer Science & Engineering, University of Washington 2Allen Institute for Artificial Intelligence, Equal contribution |
| Pseudocode | No | The paper describes algorithms (e.g., stepwise beam search) in textual format but does not present them as structured pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Code and data available at https://github.com/wellecks/naturalprover. |
| Open Datasets | Yes | We create a NATURALPROOFS-GEN dataset adapted from NATURALPROOFS [45]... [45] split the examples and reference sets into training, dev, and test splits... We adopt these splits of roughly 12.5k training, 1k validation, and 1k test examples... |
| Dataset Splits | Yes | We adopt these splits of roughly 12.5k training, 1k validation, and 1k test examples |
| Hardware Specification | Yes | All models are fine-tuned on OpenAI’s GPT-3 (Curie) using their API. They are trained for 4 epochs with a learning rate of 1e-5 on 4 NVIDIA A100s, with a batch size of 256 for approximately 10 hours for GPT-3 fine-tuning, and for 20 epochs for reference reconstruction. |
| Software Dependencies | No | The paper mentions fine-tuning on 'OpenAI’s GPT-3 (Curie) using their API' but does not provide specific version numbers for software dependencies or libraries such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | All models are fine-tuned on OpenAI’s GPT-3 (Curie) using their API. They are trained for 4 epochs with a learning rate of 1e-5 on 4 NVIDIA A100s, with a batch size of 256 for approximately 10 hours for GPT-3 fine-tuning, and for 20 epochs for reference reconstruction. |