Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Exploring and Exploiting Model Uncertainty in Bayesian Optimization

Authors: Zishi Zhang, Tao Ren, Yijie Peng

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, our method outperforms state-of-the-art approaches in various challenging scenarios, including highly non-stationary and heavy-tailed reward settings where classical GP-based BO often fails. [...] We evaluate our method on ten benchmark tasks, including six synthetic functions, three real-world problems, and one LLM prompt optimization task.
Researcher Affiliation	Academia	1 Guanghua School of Management, Peking University 2 Xiangjiang Laboratory EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Thompson Sampling with -GP Surrogate Model ( -GP-TS)
Open Source Code	Yes	Code, configuration files, and instructions for reproducing experiments will be released publicly upon publication.
Open Datasets	Yes	Portfolio [Cakmak et al., 2020] is a benchmark of tuning the hyperparameters of a trading strategy to maximize returns. [...] HPOBench [Eggensperger et al., 2021] provides standard hyperparameter tuning tasks for ML models [...]. NASBench201 [Dong and Yang, 2020] is a neural network search task on CIFAR100. [...] Prompt optimization for language understanding. We conduct prefix prompt optimization for seven language understanding datasets, including sentiment classification (SST-2 [Socher et al., 2013]), SST-5 [Socher et al., 2013], MR [Pang and Lee, 2005], CR [Hu and Liu, 2004]), topic classification (AG s News[Zhang et al., 2015], TREC[Voorhees and Tice, 2000]) and subjectivity classification (Subj [Pang and Lee, 2004]).
Dataset Splits	Yes	AG s News [Zhang et al., 2015] is a topic classification dataset [...] The dataset contains 120,000 training samples and 7,600 test samples, and is commonly used to benchmark text classification models. [...] TREC [Voorhees and Tice, 2000] [...] We use the 6-way classification task with 5,452 training and 500 test examples.
Hardware Specification	No	The paper does not explicitly describe the hardware used for running its experiments within the main text or its appendices. The NeurIPS checklist mentions "Experiments compute resources" but provides no specific hardware details.
Software Dependencies	No	The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup	Yes	To introduce more challenging scenarios, we consider non-stationary and heavy-tailed variants of these functions. In the heavy-tailed (HT) cases, all the functions are corrupted by Weibull-distributed noises. In the non-stationary (NS) setting, the base test functions are modulated by a trigonometric-exponential term of the form f NS(x) = (1 + α sin(x)ex) f(x), which introduces non-stationarity across the domain. [...] We tune three hyperparameters of a trading strategy: risk aversion [0.1, 1000], trade aversion [5, 8], and holding cost multiplier [0.1, 100]. The environment includes two random variables: bid-ask spread U[10 4, 10 2] and borrowing cost U[10 4, 10 3]. [...] The decision variable is a continuous prefix embedding prepended to each input; its quality is evaluated based on downstream task performance (e.g., accuracy). Due to the high dimensionality of the prompt embedding space, we apply Uniform Manifold Approximation and Projection (UMAP) [Mc Innes et al., 2018] to project the original space into a lower-dimensional latent space. [...] We adopt a fully Bayesian approach for the hyperparameters Θ = {β, ν, τ, σ2, ϕ} = {Θ(1), Θ(2)}, where the first-layer parameters are Θ(1) = {β, τ} and the second-layer parameters are Θ(2) = {ν, σ2, ϕ}. Specifically, the priors on hyperparameters Θ are given by β, τ 2 Np(β0, Σβ) IGamma(aτ, bτ), σ2 IGamma(aσ, bσ), ϕ U([0, bϕ]d), ν Gamma(aν, bν), where β has a Gaussian prior, τ 2 and σ2 have inverse Gamma prior, ϕ has a uniform prior on (0, bϕ] and ν has a Gamma prior. We set aτ = aσ = 2. [...] We set β0 = [1, , 1] Rd and set Σβ = Id d.