Magnushammer: A Transformer-Based Approach to Premise Selection
Authors: Maciej Mikuła, Szymon Tworkowski, Szymon Antoniak, Bartosz Piotrowski, Albert Q. Jiang, Jin Peng Zhou, Christian Szegedy, Łukasz Kuciński, Piotr Miłoś, Yuhuai Wu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Magnushammer on the PISA and mini F2F theorem proving benchmarks using proof success rate as a metric. Our main result is that Magnushammer outperforms Sledgehammer by a large margin and, combined with Thor (Jiang et al., 2022a), sets a new state of the art on the PISA benchmark (71.0% from 57.0%). Through ablations, we study the effectiveness of Magnushammer and the contribution of its components. |
| Researcher Affiliation | Collaboration | Maciej Mikuła Google Deep Mind Szymon Tworkowski x AI Szymon Antoniak Bartosz Piotrowski IDEAS NCBR Albert Qiaochu Jiang University of Cambridge Jin Peng Zhou Cornell University Christian Szegedy x AI Łukasz Kuci nski IDEAS NCBR, IMPAN Piotr Miło s IDEAS NCBR, IMPAN Yuhuai Wu x AI |
| Pseudocode | Yes | Algorithm 1 Premise selection with Magnushammer. ... Algorithm 2 Magnushammer training. ... Algorithm 3 Magnushammer evaluation in ITP environment. |
| Open Source Code | No | The paper states it open-sources a novel dataset for premise selection and provides links to datasets (e.g., https://huggingface.co/datasets/Simontwice/premise_selection_in_isabelle), but does not explicitly state that the source code for the Magnushammer methodology itself is open-sourced or provide a link to its code repository. |
| Open Datasets | Yes | We created and released1 a comprehensive dataset of textual representations for Isabelle s proof states and premises.To the best of our knowledge, this is the first high-quality dataset of this kind for Isabelle, and also the largest premise selection dataset overall. ... 1https://huggingface.co/datasets/Simontwice/premise_selection_in_isabelle. ... The backbone is pre-trained with a language modeling task on the Git Hub and ar Xiv subsets of the Pile dataset (Gao et al., 2021). |
| Dataset Splits | Yes | mini F2F consists of 488 high-school competition-level problems, split into validation and test set, each with 244 problems. |
| Hardware Specification | Yes | We gratefully acknowledge that our research was supported with Cloud TPUs from Google s TPU Research Cloud (TRC). We use TPU virtual machines from the Google Cloud Platform (GCP) for all stages: pre-training, fine-tuning, and evaluation. Each TPU virtual machine has 8 TPU v3 cores, 96 CPU cores, and over 300GB of RAM. TPU v3 cores have around 16GB of memory each. |
| Software Dependencies | No | The paper mentions 'Isabelle 2021-1' for Sledgehammer setup and 'GPT-2 tokenizer' for their models, but does not provide a reproducible list of specific version numbers for other key ancillary software components or libraries (e.g., programming language, deep learning framework versions) used in their experiments. |
| Experiment Setup | Yes | We performed the following hyperparameter sweeps. We note that we have not observed significant differences between obtained results. Learning rate: {1e 4, 2e 4, 3e 4, 5e 4}, chosen: 2e 4 Dropout: {0.0, 0.05, 0.1, 0.2}, chosen: 0.1 Weight decay: {0.02, 0.05, 0.1}, chosen: 0.02 Batch size N in SELECT: {128, 256, 512}, chosen: 256 Number of negatives M in SELECT: {0, 256, 768, 1536}, chosen: 768 Temperature for Info NCE loss in SELECT: {0.05, 0.07, 0.2, 1}, chosen: 0.07 Batch size for RERANK: {16, 32, 64}, chosen 64 Number of negatives per proof state M in RERANK: {7, 15}, chosen: 15. |