Bias-Free Scalable Gaussian Processes via Randomized Truncations

Authors: Andres Potapczynski, Luhuan Wu, Dan Biderman, Geoff Pleiss, John P Cunningham

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, through extensive empirical evaluation, we find our methods and their biased counterparts indeed constitute a bias-variance tradeoff. Both RR-CG and SS-RFF are unbiased, recovering nearly the same optimum as the exact GP method, while GP trained with CG and RFF often converge to solutions with worse likelihood. We note that bias elimination is not always practical. For SS-RFF, the optimization is slow, due to the large auxiliary variance needed to counteract the slowly decaying bias of RFF. On the other hand, RR-CG incurs a minimal variance penalty, likely due to the favorable convergence properties of CG. In a wide range of benchmark datasets, RR-CG demonstrates similar or better predictive performance compared to CG using the same expected computational time. We report prediction accuracy (RMSE) and negative log likelihood (NLL) in Fig. 6 (see appendix for full tables on predictive performance and training time).
Researcher Affiliation Academia 1Zuckerman Institute, Columbia University 2Statistics Department, Columbia University.
Pseudocode No The paper describes algorithms and methods using mathematical formulations and textual descriptions but does not include any explicit pseudocode blocks or algorithms labeled as such.
Open Source Code Yes Our code is available at https://github.com/ cunningham-lab/RTGPS.
Open Datasets Yes All experiments are implemented in GPy Torch (Gardner et al., 2018). We compare the predictive performance of GPs that use RR-CG, CG, and Cholesky for hyperparameter optimization. We use CG with J = 100 iterations, and RR-CG with E[J] = 100 expected iterations; both methods use the preconditioner of Gardner et al. (2018). All RFF models use 700 random features. For SVGP, we use 1,024 inducing points and minibatches of size 1,024 as in (Wang et al., 2019). The POE models are comprised of GP experts that are each trained on 1,024 data point subsets. For sg GP, the subsampled datasets are constructed by selecting a random point x, y and its 15 nearest neighbors as in (Chen et al., 2020). We use a subset of the Pole Tele UCI dataset. We train a GP regression model on a toy dataset: y = x sin(5πx) + ε and ε N(0, 0.01). We compare models trained on the Pole Tele dataset (Fig. 4) and the Bike dataset (Fig. 5). We compare RR-CG, CG, and Cholesky on a wide range of UCI datasets (Asuncion & Newman, 2007). We report prediction accuracy (RMSE) and negative log likelihood (NLL) in Fig. 6.
Dataset Splits Yes Each dataset is randomly split to 64% training, 16% validation and 20% testing sets.
Hardware Specification No The paper mentions 'GPU-accelerated matrix products' as a benefit of CG and states 'requiring multiple GPUs for training and testing' for very large datasets, but it does not specify any particular GPU model, CPU, or other hardware used for their experiments.
Software Dependencies No The paper states, 'All experiments are implemented in GPy Torch (Gardner et al., 2018)'. While it names a software library, it does not provide specific version numbers for GPyTorch itself, or for other key components like Python or CUDA.
Experiment Setup Yes We use CG with J = 100 iterations, and RR-CG with E[J] = 100 expected iterations; both methods use the preconditioner of Gardner et al. (2018). All RFF models use 700 random features. For SVGP, we use 1,024 inducing points and minibatches of size 1,024 as in (Wang et al., 2019). The POE models are comprised of GP experts that are each trained on 1,024 data point subsets. For sg GP, the subsampled datasets are constructed by selecting a random point x, y and its 15 nearest neighbors as in (Chen et al., 2020).