Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Omnipresent Yet Overlooked: Heat Kernels in Combinatorial Bayesian Optimization

Authors: Colin Doumont, Victor Picheny, Viacheslav (Slava) Borovitskiy, Henry Moss

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we aim to provide a clearer picture of combinatorial kernels and unify them into a common framework based on heat kernels. First, we present the necessary background information on combinatorial BO and the associated combinatorial kernels (Section 2). To relate these kernels, we then present our unifying framework, based on heat kernels, as well as our generalizations and extensions thereof (Sections 3). Finally, we analyze and validate our theoretical framework using empirical experiments (Section 4), and conclude by discussing our contributions (Section 5).
Researcher Affiliation	Collaboration	Colin Doumont ETH Zürich University of Cambridge Tübingen AI Center Victor Picheny Secondmind Viacheslav Borovitskiy ETH Zürich University of Edinburgh Henry Moss University of Cambridge Lancaster University
Pseudocode	No	The paper describes methods and derivations using mathematical equations and textual descriptions, but it does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Further details regarding our experimental set-up are described in Appendix E.2, and our implementation can be found at: https://github.com/colmont/heat-kernels-4-BO.git.
Open Datasets	Yes	Our analysis is performed on a wide range of challenging combinatorial optimization problems: Pest Control, LABS and Cluster Expansion are displayed below in Figures 1 and 2, and Contamination Control and Max SAT have been moved to Appendix E.3 (due to space constraints). All five problems are taken from Bounce (Papenmeier et al., 2023)... Additionally, in Appendices E.4 and E.5, we experiment on five (discretized) permutation-invariant Simon Fraser University (SFU) test functions (Surjanovic and Bingham, 2013), as well as two biological and two logic-synthesis tasks from the MCBO benchmark.
Dataset Splits	Yes	For the regression problems, we use different training sizes, ranging from 25 to 200 datapoints, and a test size of 200 datapoints. Here, all datapoints are sampled uniformly at random.
Hardware Specification	Yes	We were able to run all evaluated baselines on single-core CPUs with 1GB of RAM. The only notable exception is SSK (or BOSS), which is prohibitively slow on CPUs and was run on NVIDIA RTX6000 GPUs with 24GB of RAM.
Software Dependencies	No	As hyperparameter optimizer and acquisition function, we use the ubiquitous Adam optimizer (Kingma and Ba, 2015) and Expected Improvement (EI) (Moˇckus, 1975; Jones et al., 1998), respectively. The paper mentions using specific optimizers and acquisition functions but does not provide specific version numbers for software libraries or packages used in its implementation, such as Python or PyTorch versions.
Experiment Setup	Yes	Experimental set-up We use 20 initialization points, opt for a batch size of 1 and allow up to 200 iterations. As hyperparameter optimizer and acquisition function, we use the ubiquitous Adam optimizer (Kingma and Ba, 2015) and Expected Improvement (EI) (Moˇckus, 1975; Jones et al., 1998), respectively. To optimize the acquisition function in the pipelines of Figures 1, 4, 6 and 8, we use a genetic algorithm and include a trust region... No priors were imposed on the hyperparameters, and therefore we use MLE and not MAP inference. We standardize all black-box function values... The experiments are repeated across 20 random seeds, except for Contamination Control and LABS, which are noisier and therefore require double the amount of runs (i.e. 40 random seeds).