A Study of Bayesian Neural Network Surrogates for Bayesian Optimization

Authors: Yucen Lily Li, Tim G. J. Rudner, Andrew Gordon Wilson

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we study BNNs as alternatives to standard GP surrogates for optimization. We consider a variety of approximate inference procedures for finite-width BNNs, including high-quality Hamiltonian Monte Carlo, low-cost stochastic MCMC, and heuristics such as deep ensembles. We also consider infinite-width BNNs, linearized Laplace approximations, and partially stochastic models such as deep kernel learning. We evaluate this collection of surrogate models on diverse problems with varying dimensionality, number of objectives, non-stationarity, and discrete and continuous inputs. We find: (i) the ranking of methods is highly problem dependent, suggesting the need for tailored inductive biases; (ii) HMC is the most successful approximate inference procedure for fully stochastic BNNs; (iii) full stochasticity may be unnecessary as deep kernel learning is relatively competitive; (iv) deep ensembles perform relatively poorly; (v) infinite-width BNNs are particularly promising, especially in high dimensions.
Researcher Affiliation Academia Yucen Lily Li, Tim G. J. Rudner, Andrew Gordon Wilson New York University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/yucenli/bnn-bo.
Open Datasets Yes We evaluate BNN and GP surrogates on a variety of synthetic benchmarks, and we choose problems with a wide span of input dimensions to understand how the performance differs as we increase the dimensionality of the data. We also select problems that vary in the number of objectives to compare the performance of the different surrogate models. Detailed problem descriptions can be found in Appendix B.1, and we include the experiment setup in Appendix C. We use Monte-Carlo based Expected Improvement (Balandat et al., 2020) as our acquisition function for all problems. ... To provide an evaluation of BNN surrogates in more realistic optimization problems, we consider a diverse selection of real-world applications which span a variety of domains, such as solving differential equations and monitoring cellular network coverage (Dreifuerst et al., 2021; Eriksson et al., 2019; Maddox et al., 2021; Oh et al., 2019; Wang et al., 2020). ... Knowledge Distillation: For our experiment, we use the MNIST dataset, and we train a Le Net-5 for our teacher model.
Dataset Splits Yes During each iteration of Bayesian optimization, we use a grid search over prior variance (0.1, 1.0, 10.0) and likelihood variance (0.1, 0.32, 1.0) to find the combination which maximizes L, where L represents the likelihood of the surrogate model on a random 20% of the existing function evaluations when the model is trained on the other 80%.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions software components and libraries like 'Botorch' and 'SMAC' but does not specify their version numbers.
Experiment Setup Yes For all datasets, we normalize input values to be between [0, 1] and standardize output values to have mean 0 and variance 1. We also use Monte-Carlo based Expected Improvement as our acquisition function. GP: For single-objective problems, we use GPs with a Matérn 5/2 kernel, adaptive scale, a length-scale prior of Gamma(3, 6), and an output-scale prior of Gamma(2.0, 0.15). ... I-BNN: We use I-BNNs with 3 hidden layers and the Re LU activation function. We set the variance of the weights to 10.0, and the variance of the bias to 1.6. ... DKL: We set up the base kernel using the same Matérn 5/2 kernel that we use for GPs. For the feature extractor, we use the model parameters as explained above. ... HMC: We use HMC with an adaptive step size, and we choose the architecture as explained above. ... SGHMC: We use SGHMC with minibatch size of 5 and neural network architecture as indicated above. ... LLA: We use the model architecture as explained. ... ENSEMBLE: We use an ensemble of 5 models, each with the architecture explained above. Each model is trained on a random 80% of the function evaluations.