Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Automated Model Discovery via Multi-modal & Multi-step Pipeline

Authors: Lee Jung-Mok, Nam Hyeon-Woo, Moon Ye-Bin, Junhyun Nam, Tae-Hyun Oh

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 Experiments 4.1 Gaussian Process Kernel Discovery Datasets and Competing Methods. We evaluate our multi-modal & multi-step pipeline on real-world univariate datasets [33], including Airline Passenger, Solar Irradiance, Mauna Loa, Wheat, Call-Center, Radio, and Gas Production. We will refer to the data by their representative terms for convenience, e.g., Solar for Solar Irradiance. And we compare our pipeline against five competing methods ranging from traditional forecasting methods to the latest LLM-based model discovery approaches: Gaussian Process Regression with Squared Exponential kernel, ARIMA [52], Facebook Prophet [54], Automatic Statistician [14, 33], and Box LM [30]3. We employ GPT-4o-mini4 for methods using LLMs, including ours. Additional experimental details (e.g., basis kernels and kernel grammars) are given at the Appendix A.3. Result. As shown in Table Ref tab:quantitative, our method can discover better models by achieving consistently lower RMSE compared to other methods on average. While Box LM and Gaussian Process (SE) exhibit low RMSE values on the training set, their RMSEs significantly increases in the test region, indicating poor generalization. In contrast, our method maintains consistently low RMSEs across both training and test sets, highlighting its ability to generalize effectively beyond the training data.
Researcher Affiliation	Collaboration	Lee Jung-Mok1 Nam Hyeon-Woo1 Moon Ye-Bin 1 Junhyun Nam2 Tae-Hyun Oh3 1 Dept. of Electrical Engineering, POSTECH 2 Samsung Electronics 3 School of Computing, KAIST EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Model Discovery Pipeline. 1: Input: dataset D, rounds R, model pool P 2: Initialize: best model M , 3: for r = 1 to R do 4: Mr = Analyzer VLM(M , D) Proposal 5: for M Mr do 6: θ = Optimize(M, D) Fitting 7: s M = α Evaluator VLM(M, θ , D) BIC Evaluation 8: end for 9: P P Mr 10: M arg max M P s M Selection 11: end for
Open Source Code	No	This paper contains the information to reproduce the main experimental results at Sec. 4 and Appendix. We are using close-source models, and we have provided the prompt at Appendix for reproducability with similar results. The data sources used in our experiments are fully accessible and listed in the Appendix, and we provide the prompt for the main experimental results in the Appendix, with instructions for setup and usage.
Open Datasets	Yes	We evaluate our multi-modal & multi-step pipeline on real-world univariate datasets [33], including Airline Passenger, Solar Irradiance, Mauna Loa, Wheat, Call-Center, Radio, and Gas Production. The data sources used in our experiments are fully accessible and listed in the Appendix, and we provide the prompt for the main experimental results in the Appendix, with instructions for setup and usage.
Dataset Splits	Yes	The experiment for Gaussian Process Kernel Discovery, we have used the dataset of gpss-research, spliting training data into 9:1 for the validation data.
Hardware Specification	Yes	Our experiments are conducted on CPU with 16 cores for the precise calculation; and it may take around multiple hours for the experiments.
Software Dependencies	No	Our experiments are upon GPy and GPy-ABCD [56].
Experiment Setup	Yes	Our experiments are upon GPy and GPy-ABCD [56]. We have conducted each experiments for 5 rounds, with 10 random restarts, and used L-BFGS-B optimization. Also, we conduct top-3 sampling from model pool for each round. We have used gpt-4o-mini (for the main result) for both Analyzer VLM and Evaluator VLM, and we have set hyperparameter α to 50 of our Evaluator VLM to balance with the BIC of the visual criterion. Also we have utilized the current round term for scoring to select mostly on recent models from the model pool. For Symbolic Regression, we have followed [48] and utilized its dataset. We have conducted each experiments for 20 rounds, with 5 random restarts each, and used scipy s optimize curve fit for parameter optimization. The function evaluation is done similarly to the gaussian process kernel discovery, setting the hyperparameter α to 0.05. To effectively search for the parameter for kernel search, we initially performed 10 random restarts to explore the parameter space broadly. Then we substituted the resulting parameters with those proposed by Analyzer VLM. We then conducted a second-stage local optimization, using the Analyzer VLM-initialized parameters as starting points. Our experiments are conducted on CPU with 16 cores for the precise calculation; and it may take around multiple hours for the experiments. The experiment for Gaussian Process Kernel Discovery, we have used the dataset of gpss-research, spliting training data into 9:1 for the validation data. For Box LM implementation, we have followed the explanation of [30]. For the fair comparison with our methods, we have set the basis kernel as linear(LIN), squared exponential(SE), constant(C), and white noise(WN), except the rational quadratic kernel(RQ), and also sampled top-3 models. Following our pipeline s evaluation, we have used Bayesian Information Criterion for the Box LM s top-k model selection. For automatic statistician experiment, we have changed GPy-ABCD to work as greedy search through top-1 selection for each round. For ARIMA implementation, we have set ARIMA s p=2, d=1, q=2, and for facebook prophet implementation, we have set seasonality mode to multiplicative, and set changepoint prior scale to 0.1.