Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Graph-based Symbolic Regression with Invariance and Constraint Encoding

Authors: Ziyu Xiang, Kenna Ashen, Xiaofeng Qian, Xiaoning Qian

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on synthetic and real-world scientific datasets demonstrate the efficiency and accuracy of GSR in discovering underlying expressions and adhering to physical laws, offering practical solutions for scientific discovery.
Researcher Affiliation	Academia	1Texas A&M University, College Station, TX 77840, USA 2Brookhaven National Laboratory, Upton, NY 11973, USA EMAIL; EMAIL
Pseudocode	Yes	Algorithm 1 summarizes the pseudo-code for our GSR. The core of the algorithm is the repeated sampling of expression trajectories τ until either the desired solution is found or the maximum number of iterations is reached. At each trajectory step t, a batch of hn MCTS with index k and a maximum batch size B is executed, following the four steps for hn MCTS iterations in Algorithm 2.
Open Source Code	No	Both the code and data will be released in an open-access repository after the paper is publicly available.
Open Datasets	Yes	Benchmarking Datasets: we have evaluated our GSR on diverse benchmark datasets, including 1) the black-box dataset consisting of 120 tasks without solution expressions from PMLB [25], 2) Feynman dataset [11] consists of 119 physics-informed problems with ground-truth solutions, 3) Nguyen s SR benchmark dataset [26] and Nguyen constant dataset [7].
Dataset Splits	Yes	For both the Black-box and Feynman datasets, we follow the experimental settings of state-of-the-art symbolic regression benchmarking SRBench [27], which includes 21 baseline models for the black-box dataset benchmarking and 14 baseline models for the Feynman dataset benchmarking. Additionally, we compared with the recent transformer-based MCTS method, Transformer-based Planning for Symbolic Regression (TPSR) [10], in the black-box dataset; and a neural-guided GP method, A Unified Framework for Deep Symbolic Regression (u DSR) [9], in the Feynman dataset; as well as a transformer-based GP method, Deep Generative Symbolic Regression (DGSR) [6], for both datasets. We summarize our additional experimental settings in Appendix D.1
Hardware Specification	Yes	We have performed our experiments on a platform with one CPU, Intel Xeon 6248R (Cascade Lake), 3.0GHz, 24-core, and one GPU, NVIDIA A100 40GB GPU accelerator. For the reported Synthetic Dataset Benchmarking in Section 4 in the main text, our graph-based symbolic regression (GSR) can sample 60 expressions per second on average. For the materials science application in Section 5, GSR on average samples 20 expressions per second. First-principles DFT calculations of the test dataset were performed on a compute node consisting of two Intel Xeon 6248R (Cascade Lake) CPUs with a total of 48 cores and 384GB DDR4 memory.
Software Dependencies	No	The paper mentions VASP [31] for DFT calculations but does not provide a specific version number. No other key software components are mentioned with version numbers.
Experiment Setup	Yes	The maximum complexity H for each expression is set to 50. The batch size B is set to 100 for MCTS simulations, and the maximum number of epochs is capped at 1,000, with a total search space of up to one million expressions for MCTS and GSR. The training and testing datasets are divided equally, with 20 randomly generated data points for each.