Stochastic Gradient Descent for Gaussian Processes Done Right
Authors: Jihao Andreas Lin, Shreyas Padhy, Javier Antoran, Austin Tripp, Alexander Terenin, Csaba Szepesvari, José Miguel Hernández-Lobato, David Janz
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present three experiments that confirm the strength of our SDD algorithm on standard benchmarks. The first two experiments, on UCI regression and large-scale Bayesian optimisation, replicate those of Lin et al. (2023), and compare against SGD (Lin et al., 2023), CG (Wang et al., 2019) and SVGP (Hensman et al., 2013). Unless indicated otherwise, we use the code, setup and hyperparameters of Lin et al. (2023). Our third experiment tests stochastic dual descent on five molecule-protein binding affinity prediction benchmarks of García-Ortegón et al. (2022). |
| Researcher Affiliation | Academia | 1University of Cambridge 2Max Planck Institute for Intelligent Systems 3Cornell University 4University of Alberta |
| Pseudocode | Yes | Algorithm 1 Stochastic dual descent for approximating α (b) = (K + λI) 1b Require: Kernel matrix K with rows K1, . . . , Kn Rn, targets b Rn, likelihood variance λ > 0, number of steps T N+, batch size B {1, . . . , n}, step size β > 0, momentum parameter ρ [0, 1), averaging parameter r (0, 1] 1: v0 = 0; α0 = 0; α0 = 0 all in Rn 2: for t {1, . . . , T} do 3: Sample It = (it 1, . . . , it B) Uniform({1, . . . , n}) independently random coordinates 4: gt = n i It((Ki + λei)T(αt 1 + ρvt 1) bi)ei gradient estimate 5: vt = ρvt 1 βgt velocity update 6: αt = αt 1 + vt parameter update 7: αt = rαt + (1 r)αt 1 iterate averaging 8: return αT |
| Open Source Code | Yes | Code available at: HTTPS://GITHUB.COM/CAMBRIDGE-MLG/SGD-GP. |
| Open Datasets | Yes | We benchmark on the 9 UCI repository regression data sets (Dua and Graff, 2017) used by Wang et al. (2019) and Lin et al. (2023). We use the DOCKSTRING regression benchmark of García-Ortegón et al. (2022) |
| Dataset Splits | Yes | We report mean values and standard error across five 90%-train 10%-test splits for all data sets, except the largest, where three splits are used. For each protein, we use a standard train-test splits of 210k and 40k molecules, respectively. |
| Hardware Specification | Yes | Table 1: Root-mean-square error (RMSE), compute time (on an A100 GPU), and negative loglikelihood (NLL), for 9 UCI regression tasks for all methods considered. |
| Software Dependencies | Yes | We use the implementation and hyperparameter configurations from Lin et al. (2023) for the SGD, SVGP and CG baselines,1 which uses the jax library (Bradbury et al., 2018). Some more detail on the implementations is as follows. CG We implement CG using the jax.scipy library, with a pivoted Cholesky preconditioner of size 100 computed using the Tensor Flow Probability library. SVGP We initialise inducing points using k-means and k-means++. We then optimise all variational parameters, including inducing point locations, by maximising the ELBO with the Adam optimiser until convergence, using the GPJax library (Pinder and Dodd, 2022). |
| Experiment Setup | Yes | Unless indicated otherwise, we use the code, setup and hyperparameters of Lin et al. (2023). Our third experiment tests stochastic dual descent on five molecule-protein binding affinity prediction benchmarks of García-Ortegón et al. (2022). For all methods, we use a Matérn-3/2 kernel with a fixed set of hyperparameters chosen by Lin et al. (2023) via maximum likelihood. SGD uses βn = 0.5 to estimate the mean function weights, and βn = 0.1 to draw samples. For SDD, we use step-sizes βn which are 100 times larger, except for ELEVATORS, KEGGDIR and BUZZ, where this causes divergence; there, we use a 10 larger step-size instead. SDD and SGD are both run for 100k steps with batch size B = 512 for both the mean function and posterior samples. For CG, we use a maximum of 1000 steps for data sets with N 50k, and a tolerance of 0.01. On the four largest data sets, the per step cost of CG is too large to run 1000 steps, and we run 100 steps instead. For SVGP, the number of inducing points is chosen such that the compute cost approximately matches that of other methods: 3000 for the smaller five data sets and 9000 for the larger four. |