Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning Interestingness in Automated Mathematical Theory Formation

Authors: George Tsoukalas, Rahul Saha, Amitayush Thakur, Sabrina Reguyal, Swarat Chaudhuri

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we present empirical results evaluating the effectiveness of our approach. We aim to answer key questions about the ability of Evo Abstract to learn effective interestingness measures and the capability of FERMAT, guided by these measures, to generate meaningful mathematical theories. Environment Configuration. FERMAT centrally supports exploration in elementary number theory and finite fields, as these areas are extremely rich while easily represented. The number theory environment is supported by the Z3 Theorem Prover, while finite field reasoning is handled by a custom prover implemented in Python. For number theory, the ground truth benchmark E used for the extrinsic reward function R comprises 180 concepts, conjectures, and theorems sourced from an introductory number theory textbook [2], constituting a set of interesting entities. We similarly curated 67 such ground truth entities over F27, the primary finite-field setting in our experiments. These concepts span a range of theoretical sophistication, from the reflexive properties to the Goldbach conjecture. A detailed description of E along with subsets of the ground truth knowledge graph is detailed in Appendix A.4. For our experiments, we use three different starting configurations: (i) succ_zero_eq The definitions of zero, successor function, and the equality predicate with arity 2; (ii) arithmetic_base Containing zero, one, two, addition, multiplication, divides, , and the Figure 3: A plot of the best program found per iteration for Fun Search and Evo Abstract, shown for the three different starting knowledge graphs, averaged over four runs. As can be seen, the Evo Abstract and Fun Search methods dominate performance universally across all baselines. Evaluation Metrics. To evaluate an interestingness measure, we instantiate the scoring function as extrinsic reward obtained through episodic rollouts of a policy depending on the measure through FERMAT. We run 64 episodes with a timeout of 60 seconds*, and average the reward. This evaluation metric measures the ability of the given policy to reconstruct the curriculum of human-made ground truth mathematical entities E. We also provide qualitative analysis of the learned interestingness measures and the content of the generated theories. Baselines. We compare Evo Abstract against the following baseline methods for generating or selecting interestingness measures:
Researcher Affiliation	Academia	George Tsoukalas UT Austin EMAIL Rahul Saha UT Austin EMAIL Amitayush Thakur UT Austin EMAIL Sabrina Reguyal Princeton University, Stanford University EMAIL Swarat Chaudhuri UT Austin EMAIL
Pseudocode	Yes	The overall Evo Abstract algorithm, detailed in Algorithm 1 (Appendix), thus proceeds in generations. Algorithm 1 Evo Abstract: Synthesis via Evolution and Abstraction Learning Algorithm 2 Policy Template for Action Selection
Open Source Code	Yes	We open-source the FERMAT environment at github.com/trishullab/Fermat.
Open Datasets	No	For number theory, the ground truth benchmark E used for the extrinsic reward function R comprises 180 concepts, conjectures, and theorems sourced from an introductory number theory textbook [2], constituting a set of interesting entities. We similarly curated 67 such ground truth entities over F27, the primary finite-field setting in our experiments.
Dataset Splits	No	The paper describes episodic rollouts for evaluation and does not mention traditional dataset splits like train/test/validation for a fixed dataset. The 'ground truth benchmark E' is a curated set of concepts, not a dataset in the sense of data splits.
Hardware Specification	Yes	Our experiments are run on 64 Intel Xeon Platinum 8352Y and 64 AMD EPYC 7413 24C CPUs.
Software Dependencies	No	The FERMAT framework, implemented in Python, provides the environment for automated theory formation. We instantiate this backend using the Z3 Theorem Prover [11] The compiler is written using the parglare [14] library. We instantiate both the evolution and abstraction samplers Lvar, Labs to use GPT-4o-mini
Experiment Setup	Yes	Evaluation Metrics. To evaluate an interestingness measure, we instantiate the scoring function as extrinsic reward obtained through episodic rollouts of a policy depending on the measure through FERMAT. We run 64 episodes with a timeout of 60 seconds*, and average the reward. Evo Abstract configuration. We configure Evo Abstract to employ k = 4 islands and runs over Ngen = 64 iterations, with each interestingness function being evaluated in 16 i.i.d. rollouts. We perform the abstraction phase every 8 iterations, sampling at most two abstractions per island. When resolving conjectures, we set the timeout to Z3 to be 2.0 seconds, and to 0.5 seconds when using Z3 to add instances to entities without any.