reproducibilityindex.ai

Robust Gaussian Processes via Relevance Pursuit

Authors: Sebastian Ament, Elizabeth Santorella, David Eriksson, Ben Letham, Maximilian Balandat, Eytan Bakshy

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the empirical performance of RRP against various baselines on a number of regression and Bayesian Optimization problems. Specifically, we compare against a standard GP with a Matern5/2 kernel ( Standard GP ), data pre-processing through Ax s adaptive winsorization procedure ( Adapt. Wins. ), and a power transformation ( Power Transf. ) [15]. Further, we also consider a Student-t likelihood model from Jylänki et al. [32] ( Student-t ), the trimmed marginal likelihood model from Andrade and Takeda [10] ( Trimmed MLL ), and the RCGP model from Altamirano et al. [2]. Unless stated otherwise, all models are implemented in GPy Torch [25] and all experiments in this section use 32 replications. See Appendix D for additional details.
Researcher Affiliation	Industry	Sebastian Ament Meta ament@meta.com Elizabeth Santorella Meta santorella@meta.com David Eriksson Meta deriksson@meta.com Ben Letham Meta bletham@meta.com Maximilian Balandat Meta balandat@meta.com Eytan Bakshy Meta ebakshy@meta.com
Pseudocode	Yes	Algorithm 1 Relevance Pursuit (Forward Algorithm) Require: X, y, schedule K = (k1, k2, . . . , k K), use Bayesian Model Selection (boolean)
Open Source Code	Yes	Our method can be readily applied to both robust regression problems as well as applications such as Bayesian optimization and is available through Bo Torch [13]. Importantly, it also provides theoretical approximation guarantees.
Open Datasets	Yes	We report extended empirical comparisons with Altamirano et al. [2] s RCGP method, using their experimental setup and method implementation in GPFLow. Including GPFlow in our own benchmarking setup and compute resources proved difficult. To circumvent this, we wrote wrappers for both Bo Torch s standard GP and RRP, which also accounts for any orthogonal implementation differences between the two frameworks, and ran the benchmarks locally on an M-series Mac Book. See Tables 2 and 3 for the mean absolute error and negative log predictive density, respectively. The tables include the empirical mean and standard deviation over 20 replications on corrupted version of the following base data sets: 1) Synthetic, which is generated as a draw of a GP with a exponentiated quadratic kernel, and four data sets available on the UCI machine learning repository[34], in particular, 2) Boston [29], 3) Concrete [68], 4) Energy [63], and 5) Yacht [27].
Dataset Splits	No	The paper mentions 'training data' and 'test data' or 'test-set log likelihood' but does not explicitly describe validation splits or their proportions.
Hardware Specification	Yes	We estimate the amount of compute spent on these experiments to be around 2 CPU years in total, using standard (Intel Xeon) CPU hardware. The amount of compute spent on exploratory investigations as part of this work was negligible (this was ad-hoc exploratory and development work on a single CPU machine).
Software Dependencies	No	The paper mentions software like 'GPy Torch [25]' and 'GPFLow' but does not specify their version numbers, which are required for reproducibility.
Experiment Setup	Yes	We use Ament et al. [7] s q Log Noisy Expected Improvement (q Log NEI), a variant of the Log EI family of acquisition functions, 32 replications, and initialize all methods with the same quasi-random Sobol batch for each replication. We follow Hvarfner et al. [31] and plot the true value of the best in-sample point according to the GP model posterior at each iteration. We also include Sobol and an Oracle , which is a Standard GP that always observes the uncorrupted value, and consider the backward canonical version of relevance pursuit, denoted by RRP, for these experiments. The plots show the mean performance with a bootstrapped 90% confidence interval.