BAIT: Benchmarking (Embedding) Architectures for Interactive Theorem-Proving

Authors: Sean Lamont, Michael Norrish, Amir Dezfouli, Christian Walder, Paul Montague

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate BAIT s capabilities with an in-depth comparison, across several ITP benchmarks, of state-of-the-art architectures applied to the problem of formula embedding. We find that Structure Aware Transformers perform particularly well, improving on techniques associated with the original problem sets. BAIT also allows us to assess the end-to-end proving performance of systems built on interactive environments. This unified perspective reveals a novel end-to-end system that improves on prior work. We also provide a qualitative analysis, illustrating that improved performance is associated with more semantically-aware embeddings.
Researcher Affiliation Collaboration Sean Lamont1, 2, Michael Norrish1, Amir Dezfouli3, Christian Walder4, Paul Montague2 1Australian National University 2Defence Science and Technology Group 3BIMLOGIQ 4Google Deep Mind
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Being fully open source, the addition of new benchmarks and algorithms is facilitated by a modular and decoupled design. 1https://github.com/sean-lamont/bait
Open Datasets Yes The benchmarks we use for supervised experiments are HOLStep (Kaliszyk, Chollet, and Szegedy 2017), MIZAR40 (Kaliszyk and Urban 2015), Lean Step (Han et al. 2022) and HOList (Bansal et al. 2019a). We also include a new HOL4 premise selection task, which we generate using theorem dependencies from the HOL4 standard library.
Dataset Splits Yes We follow their protocol of a random 80:20 training and validation split, with 948 training goals and 237 validation goals. The best scoring checkpoint on the validation set was then assessed on the test set.
Hardware Specification No The paper states: "Due to computational constraints, we were unable to extensively evaluate HOList models on the end-to-end prover, with previous approaches using, for example, 2000 CPUs (Bansal et al. 2019b)." This mentions hardware for *previous approaches*, not the authors' specific hardware for their own experiments. No specific GPU/CPU models or detailed computer specifications for their experiments are provided.
Software Dependencies No The paper mentions software tools like Weights and Biases and Hydra, but does not provide specific version numbers for these or other key software dependencies required for replication.
Experiment Setup Yes We perform a hyperparameter sweep over configurations for all architectures. We set a maximum sequence length of 1024. For HOList (Bansal et al. 2019a), we follow the exact supervised training setup described in (Paliwal et al. 2020). The full set of hyperparameters is in the supplementary material, with further details on the datasets and model architectures.