Bias-Free Scalable Gaussian Processes via Randomized Truncations
Authors: Andres Potapczynski, Luhuan Wu, Dan Biderman, Geoff Pleiss, John P Cunningham
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, through extensive empirical evaluation, we find our methods and their biased counterparts indeed constitute a bias-variance tradeoff. Both RR-CG and SS-RFF are unbiased, recovering nearly the same optimum as the exact GP method, while GP trained with CG and RFF often converge to solutions with worse likelihood. We note that bias elimination is not always practical. For SS-RFF, the optimization is slow, due to the large auxiliary variance needed to counteract the slowly decaying bias of RFF. On the other hand, RR-CG incurs a minimal variance penalty, likely due to the favorable convergence properties of CG. In a wide range of benchmark datasets, RR-CG demonstrates similar or better predictive performance compared to CG using the same expected computational time. We report prediction accuracy (RMSE) and negative log likelihood (NLL) in Fig. 6 (see appendix for full tables on predictive performance and training time). |
| Researcher Affiliation | Academia | 1Zuckerman Institute, Columbia University 2Statistics Department, Columbia University. |
| Pseudocode | No | The paper describes algorithms and methods using mathematical formulations and textual descriptions but does not include any explicit pseudocode blocks or algorithms labeled as such. |
| Open Source Code | Yes | Our code is available at https://github.com/ cunningham-lab/RTGPS. |
| Open Datasets | Yes | All experiments are implemented in GPy Torch (Gardner et al., 2018). We compare the predictive performance of GPs that use RR-CG, CG, and Cholesky for hyperparameter optimization. We use CG with J = 100 iterations, and RR-CG with E[J] = 100 expected iterations; both methods use the preconditioner of Gardner et al. (2018). All RFF models use 700 random features. For SVGP, we use 1,024 inducing points and minibatches of size 1,024 as in (Wang et al., 2019). The POE models are comprised of GP experts that are each trained on 1,024 data point subsets. For sg GP, the subsampled datasets are constructed by selecting a random point x, y and its 15 nearest neighbors as in (Chen et al., 2020). We use a subset of the Pole Tele UCI dataset. We train a GP regression model on a toy dataset: y = x sin(5πx) + ε and ε N(0, 0.01). We compare models trained on the Pole Tele dataset (Fig. 4) and the Bike dataset (Fig. 5). We compare RR-CG, CG, and Cholesky on a wide range of UCI datasets (Asuncion & Newman, 2007). We report prediction accuracy (RMSE) and negative log likelihood (NLL) in Fig. 6. |
| Dataset Splits | Yes | Each dataset is randomly split to 64% training, 16% validation and 20% testing sets. |
| Hardware Specification | No | The paper mentions 'GPU-accelerated matrix products' as a benefit of CG and states 'requiring multiple GPUs for training and testing' for very large datasets, but it does not specify any particular GPU model, CPU, or other hardware used for their experiments. |
| Software Dependencies | No | The paper states, 'All experiments are implemented in GPy Torch (Gardner et al., 2018)'. While it names a software library, it does not provide specific version numbers for GPyTorch itself, or for other key components like Python or CUDA. |
| Experiment Setup | Yes | We use CG with J = 100 iterations, and RR-CG with E[J] = 100 expected iterations; both methods use the preconditioner of Gardner et al. (2018). All RFF models use 700 random features. For SVGP, we use 1,024 inducing points and minibatches of size 1,024 as in (Wang et al., 2019). The POE models are comprised of GP experts that are each trained on 1,024 data point subsets. For sg GP, the subsampled datasets are constructed by selecting a random point x, y and its 15 nearest neighbors as in (Chen et al., 2020). |