Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
NaturalProver: Grounded Mathematical Proof Generation with Language Models
Authors: Sean Welleck, Jiacheng Liu, Ximing Lu, Hannaneh Hajishirzi, Yejin Choi
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We study large-scale language models on two new generation tasks: suggesting the next step in a mathematical proof, and full proof generation. We develop NATURALPROVER, a language model that generates proofs by conditioning on background references (e.g. theorems and deļ¬nitions that are either retrieved or human-provided), and optionally enforces their presence with constrained decoding. On theorems from the NATURALPROOFS benchmark, NATURALPROVER improves the quality of next-step suggestions and generated proofs over ļ¬ne-tuned GPT-3, according to human evaluations from university-level mathematics students. NATURALPROVER is capable of proving some theorems that require short (2-6 step) proofs, and providing next-step suggestions that are rated as correct and useful over 40% of the time, which is to our knowledge the ļ¬rst demonstration of these capabilities using neural language models.1 |
| Researcher Affiliation | Collaboration | Sean Welleck1,2 , Jiacheng Liu1 , Ximing Lu2, Hannaneh Hajishirzi1,2, Yejin Choi1,2 1Paul G. Allen School of Computer Science & Engineering, University of Washington 2Allen Institute for Artiļ¬cial Intelligence, Equal contribution |
| Pseudocode | No | The paper describes algorithms (e.g., stepwise beam search) in textual format but does not present them as structured pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Code and data available at https://github.com/wellecks/naturalprover. |
| Open Datasets | Yes | We create a NATURALPROOFS-GEN dataset adapted from NATURALPROOFS [45]... [45] split the examples and reference sets into training, dev, and test splits... We adopt these splits of roughly 12.5k training, 1k validation, and 1k test examples... |
| Dataset Splits | Yes | We adopt these splits of roughly 12.5k training, 1k validation, and 1k test examples |
| Hardware Specification | Yes | All models are ļ¬ne-tuned on OpenAIās GPT-3 (Curie) using their API. They are trained for 4 epochs with a learning rate of 1e-5 on 4 NVIDIA A100s, with a batch size of 256 for approximately 10 hours for GPT-3 ļ¬ne-tuning, and for 20 epochs for reference reconstruction. |
| Software Dependencies | No | The paper mentions fine-tuning on 'OpenAIās GPT-3 (Curie) using their API' but does not provide specific version numbers for software dependencies or libraries such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | All models are ļ¬ne-tuned on OpenAIās GPT-3 (Curie) using their API. They are trained for 4 epochs with a learning rate of 1e-5 on 4 NVIDIA A100s, with a batch size of 256 for approximately 10 hours for GPT-3 ļ¬ne-tuning, and for 20 epochs for reference reconstruction. |