Magnushammer: A Transformer-Based Approach to Premise Selection

Authors: Maciej Mikuła, Szymon Tworkowski, Szymon Antoniak, Bartosz Piotrowski, Albert Q. Jiang, Jin Peng Zhou, Christian Szegedy, Łukasz Kuciński, Piotr Miłoś, Yuhuai Wu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Magnushammer on the PISA and mini F2F theorem proving benchmarks using proof success rate as a metric. Our main result is that Magnushammer outperforms Sledgehammer by a large margin and, combined with Thor (Jiang et al., 2022a), sets a new state of the art on the PISA benchmark (71.0% from 57.0%). Through ablations, we study the effectiveness of Magnushammer and the contribution of its components.
Researcher Affiliation Collaboration Maciej Mikuła Google Deep Mind Szymon Tworkowski x AI Szymon Antoniak Bartosz Piotrowski IDEAS NCBR Albert Qiaochu Jiang University of Cambridge Jin Peng Zhou Cornell University Christian Szegedy x AI Łukasz Kuci nski IDEAS NCBR, IMPAN Piotr Miło s IDEAS NCBR, IMPAN Yuhuai Wu x AI
Pseudocode Yes Algorithm 1 Premise selection with Magnushammer. ... Algorithm 2 Magnushammer training. ... Algorithm 3 Magnushammer evaluation in ITP environment.
Open Source Code No The paper states it open-sources a novel dataset for premise selection and provides links to datasets (e.g., https://huggingface.co/datasets/Simontwice/premise_selection_in_isabelle), but does not explicitly state that the source code for the Magnushammer methodology itself is open-sourced or provide a link to its code repository.
Open Datasets Yes We created and released1 a comprehensive dataset of textual representations for Isabelle s proof states and premises.To the best of our knowledge, this is the first high-quality dataset of this kind for Isabelle, and also the largest premise selection dataset overall. ... 1https://huggingface.co/datasets/Simontwice/premise_selection_in_isabelle. ... The backbone is pre-trained with a language modeling task on the Git Hub and ar Xiv subsets of the Pile dataset (Gao et al., 2021).
Dataset Splits Yes mini F2F consists of 488 high-school competition-level problems, split into validation and test set, each with 244 problems.
Hardware Specification Yes We gratefully acknowledge that our research was supported with Cloud TPUs from Google s TPU Research Cloud (TRC). We use TPU virtual machines from the Google Cloud Platform (GCP) for all stages: pre-training, fine-tuning, and evaluation. Each TPU virtual machine has 8 TPU v3 cores, 96 CPU cores, and over 300GB of RAM. TPU v3 cores have around 16GB of memory each.
Software Dependencies No The paper mentions 'Isabelle 2021-1' for Sledgehammer setup and 'GPT-2 tokenizer' for their models, but does not provide a reproducible list of specific version numbers for other key ancillary software components or libraries (e.g., programming language, deep learning framework versions) used in their experiments.
Experiment Setup Yes We performed the following hyperparameter sweeps. We note that we have not observed significant differences between obtained results. Learning rate: {1e 4, 2e 4, 3e 4, 5e 4}, chosen: 2e 4 Dropout: {0.0, 0.05, 0.1, 0.2}, chosen: 0.1 Weight decay: {0.02, 0.05, 0.1}, chosen: 0.02 Batch size N in SELECT: {128, 256, 512}, chosen: 256 Number of negatives M in SELECT: {0, 256, 768, 1536}, chosen: 768 Temperature for Info NCE loss in SELECT: {0.05, 0.07, 0.2, 1}, chosen: 0.07 Batch size for RERANK: {16, 32, 64}, chosen 64 Number of negatives per proof state M in RERANK: {7, 15}, chosen: 15.