Stochastic Gradient Descent for Gaussian Processes Done Right

Authors: Jihao Andreas Lin, Shreyas Padhy, Javier Antoran, Austin Tripp, Alexander Terenin, Csaba Szepesvari, José Miguel Hernández-Lobato, David Janz

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present three experiments that confirm the strength of our SDD algorithm on standard benchmarks. The first two experiments, on UCI regression and large-scale Bayesian optimisation, replicate those of Lin et al. (2023), and compare against SGD (Lin et al., 2023), CG (Wang et al., 2019) and SVGP (Hensman et al., 2013). Unless indicated otherwise, we use the code, setup and hyperparameters of Lin et al. (2023). Our third experiment tests stochastic dual descent on five molecule-protein binding affinity prediction benchmarks of García-Ortegón et al. (2022).
Researcher Affiliation Academia 1University of Cambridge 2Max Planck Institute for Intelligent Systems 3Cornell University 4University of Alberta
Pseudocode Yes Algorithm 1 Stochastic dual descent for approximating α (b) = (K + λI) 1b Require: Kernel matrix K with rows K1, . . . , Kn Rn, targets b Rn, likelihood variance λ > 0, number of steps T N+, batch size B {1, . . . , n}, step size β > 0, momentum parameter ρ [0, 1), averaging parameter r (0, 1] 1: v0 = 0; α0 = 0; α0 = 0 all in Rn 2: for t {1, . . . , T} do 3: Sample It = (it 1, . . . , it B) Uniform({1, . . . , n}) independently random coordinates 4: gt = n i It((Ki + λei)T(αt 1 + ρvt 1) bi)ei gradient estimate 5: vt = ρvt 1 βgt velocity update 6: αt = αt 1 + vt parameter update 7: αt = rαt + (1 r)αt 1 iterate averaging 8: return αT
Open Source Code Yes Code available at: HTTPS://GITHUB.COM/CAMBRIDGE-MLG/SGD-GP.
Open Datasets Yes We benchmark on the 9 UCI repository regression data sets (Dua and Graff, 2017) used by Wang et al. (2019) and Lin et al. (2023). We use the DOCKSTRING regression benchmark of García-Ortegón et al. (2022)
Dataset Splits Yes We report mean values and standard error across five 90%-train 10%-test splits for all data sets, except the largest, where three splits are used. For each protein, we use a standard train-test splits of 210k and 40k molecules, respectively.
Hardware Specification Yes Table 1: Root-mean-square error (RMSE), compute time (on an A100 GPU), and negative loglikelihood (NLL), for 9 UCI regression tasks for all methods considered.
Software Dependencies Yes We use the implementation and hyperparameter configurations from Lin et al. (2023) for the SGD, SVGP and CG baselines,1 which uses the jax library (Bradbury et al., 2018). Some more detail on the implementations is as follows. CG We implement CG using the jax.scipy library, with a pivoted Cholesky preconditioner of size 100 computed using the Tensor Flow Probability library. SVGP We initialise inducing points using k-means and k-means++. We then optimise all variational parameters, including inducing point locations, by maximising the ELBO with the Adam optimiser until convergence, using the GPJax library (Pinder and Dodd, 2022).
Experiment Setup Yes Unless indicated otherwise, we use the code, setup and hyperparameters of Lin et al. (2023). Our third experiment tests stochastic dual descent on five molecule-protein binding affinity prediction benchmarks of García-Ortegón et al. (2022). For all methods, we use a Matérn-3/2 kernel with a fixed set of hyperparameters chosen by Lin et al. (2023) via maximum likelihood. SGD uses βn = 0.5 to estimate the mean function weights, and βn = 0.1 to draw samples. For SDD, we use step-sizes βn which are 100 times larger, except for ELEVATORS, KEGGDIR and BUZZ, where this causes divergence; there, we use a 10 larger step-size instead. SDD and SGD are both run for 100k steps with batch size B = 512 for both the mean function and posterior samples. For CG, we use a maximum of 1000 steps for data sets with N 50k, and a tolerance of 0.01. On the four largest data sets, the per step cost of CG is too large to run 1000 steps, and we run 100 steps instead. For SVGP, the number of inducing points is chosen such that the compute cost approximately matches that of other methods: 3000 for the smaller five data sets and 9000 for the larger four.