reproducibilityindex.ai

Efficient Semantic Features for Automated Reasoning over Large Theories

Authors: Cezary Kaliszyk, Josef Urban, Jiri Vyskocil

IJCAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work we (i) propose novel semantic features characterizing the statements in such large semantic knowledge bases, (ii) propose and carry out their efﬁcient implementation using deductive-AI datastructures such as substitution trees and discrimination nets, and (iii) show that they signiﬁcantly improve the strength of existing knowledge selection methods and automated reasoning methods over the large formal knowledge bases. In particular, on a standard large-theory benchmark we improve the average predicted rank of a mathematical statement needed for a proof by 22% in comparison with state of the art. This allows us to prove 8% more theorems in comparison with state of the art.
Researcher Affiliation	Academia	Cezary Kaliszyk University of Innsbruck Josef Urban Radboud University Jiˇr ı Vyskoˇcil Czech Technical University in Prague
Pseudocode	No	The paper describes algorithmic approaches verbally but does not provide structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statement or link indicating the release of open-source code for the described methodology.
Open Datasets	Yes	Here we rely on the MPTP2078 benchmark [Alama et al., 2014] used in the 2012 CASC@Turing ATP competition [Sutcliffe, 2013] of the Alan Turing Centenary Conference. Examples of such corpora are the Mizar Mathematical Library (MML), the Isabelle Archive of Formal Proofs (AFP), and the Flyspeck (Formal Proof of the Kepler Conjecture) development done in HOL Light.
Dataset Splits	Yes	The formulas and problems have a natural linear ordering derived from their (chronological) order of appearance in the Mizar library. As usual in such large-theory evaluations [Kaliszyk and Urban, 2014], we emulate the scenario of using AI/ATP assistance in interactive theorem proving: For each conjecture C we assume that all formulas stated earlier in the development can be used to prove C. This scenario results in large ATP problems that have 1877 axioms on average. Such problems are typically difﬁcult to solve without techniques for pre-selection of the most relevant axioms (cf. Table 4). For each conjecture C we also use its ITP (human-written) proof to extract only the premises needed for the ITP proof of C, i.e., P(C). This proof information is used in three ways: (i) to train the machine learners on all previous proofs for each conjecture C, (ii) to compare the premises predicted by such trained machine learners with the actual proof premises P(C), and (iii) to construct the small ATP problem for C, which (unlike the large version) contains as axioms only the (few) premises P(C) and in this way, the ATP is very signiﬁcantly advised by the human author of the ITP proof.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU/GPU models, memory specifications) used for running the experiments. It only mentions time limits for ATP runs.
Software Dependencies	Yes	We limit the evaluation to two fast learning methods: distance-weighted k-NN and naive Bayes (...) Table 4: Unaided ATPs on the large and small problems. ATP E 1.8 Vampire 2.6 Z3
Experiment Setup	Yes	Assuming a method F for extracting mathematically relevant features characterizing the theorems, this setting leads to the following multi-label learning task: Each proved theorem c Γ produces a training example consisting of F(c) and P(c), i.e., given the features F(c) we want the trained predictor to recommend the labels P(c) . The learning methods mentioned above are typically used as rankers: given the features of a new conjecture c, the highest ranked (using heuristically determined thresholds) labels for F(c) are given to an ATP, which then attempts a proof. To maximize the ATP performance, usually a portfolio of several most complementary learning methods, feature characterizations and ranking thresholds is used, and the (typically exponentially behaving) ATP is run in a strategy-scheduling mode [Tammet, 1997], i.e., with several shorter time limits over such complementary predictions rather than using the whole time limit for the most promising method. For each of the best new and old premise-selection methods we create six versions of ATP problems, by taking the topranking segments of 8, 16, 32, 64, 128, and 256 predicted premises, and we run E on such problems for 10 seconds. Each problem thus takes again at most 60 seconds altogether, allowing us to compare the results also with the 60-second unaided ATP runs on the large and small problems.