Robust Gaussian Processes via Relevance Pursuit

Authors: Sebastian Ament, Elizabeth Santorella, David Eriksson, Ben Letham, Maximilian Balandat, Eytan Bakshy

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the empirical performance of RRP against various baselines on a number of regression and Bayesian Optimization problems. Specifically, we compare against a standard GP with a Matern5/2 kernel ( Standard GP ), data pre-processing through Ax s adaptive winsorization procedure ( Adapt. Wins. ), and a power transformation ( Power Transf. ) [15]. Further, we also consider a Student-t likelihood model from Jylänki et al. [32] ( Student-t ), the trimmed marginal likelihood model from Andrade and Takeda [10] ( Trimmed MLL ), and the RCGP model from Altamirano et al. [2]. Unless stated otherwise, all models are implemented in GPy Torch [25] and all experiments in this section use 32 replications. See Appendix D for additional details.
Researcher Affiliation Industry Sebastian Ament Meta ament@meta.com Elizabeth Santorella Meta santorella@meta.com David Eriksson Meta deriksson@meta.com Ben Letham Meta bletham@meta.com Maximilian Balandat Meta balandat@meta.com Eytan Bakshy Meta ebakshy@meta.com
Pseudocode Yes Algorithm 1 Relevance Pursuit (Forward Algorithm) Require: X, y, schedule K = (k1, k2, . . . , k K), use Bayesian Model Selection (boolean)
Open Source Code Yes Our method can be readily applied to both robust regression problems as well as applications such as Bayesian optimization and is available through Bo Torch [13]. Importantly, it also provides theoretical approximation guarantees.
Open Datasets Yes We report extended empirical comparisons with Altamirano et al. [2] s RCGP method, using their experimental setup and method implementation in GPFLow. Including GPFlow in our own benchmarking setup and compute resources proved difficult. To circumvent this, we wrote wrappers for both Bo Torch s standard GP and RRP, which also accounts for any orthogonal implementation differences between the two frameworks, and ran the benchmarks locally on an M-series Mac Book. See Tables 2 and 3 for the mean absolute error and negative log predictive density, respectively. The tables include the empirical mean and standard deviation over 20 replications on corrupted version of the following base data sets: 1) Synthetic, which is generated as a draw of a GP with a exponentiated quadratic kernel, and four data sets available on the UCI machine learning repository[34], in particular, 2) Boston [29], 3) Concrete [68], 4) Energy [63], and 5) Yacht [27].
Dataset Splits No The paper mentions 'training data' and 'test data' or 'test-set log likelihood' but does not explicitly describe validation splits or their proportions.
Hardware Specification Yes We estimate the amount of compute spent on these experiments to be around 2 CPU years in total, using standard (Intel Xeon) CPU hardware. The amount of compute spent on exploratory investigations as part of this work was negligible (this was ad-hoc exploratory and development work on a single CPU machine).
Software Dependencies No The paper mentions software like 'GPy Torch [25]' and 'GPFLow' but does not specify their version numbers, which are required for reproducibility.
Experiment Setup Yes We use Ament et al. [7] s q Log Noisy Expected Improvement (q Log NEI), a variant of the Log EI family of acquisition functions, 32 replications, and initialize all methods with the same quasi-random Sobol batch for each replication. We follow Hvarfner et al. [31] and plot the true value of the best in-sample point according to the GP model posterior at each iteration. We also include Sobol and an Oracle , which is a Standard GP that always observes the uncorrupted value, and consider the backward canonical version of relevance pursuit, denoted by RRP, for these experiments. The plots show the mean performance with a bootstrapped 90% confidence interval.