Variance Reduction and Quasi-Newton for Particle-Based Variational Inference

Authors: Michael Zhu, Chang Liu, Jun Zhu

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate the accelerated convergence of variance reduction and quasi-Newton methods for Par VIs for accurate posterior inference in large-scale and ill-conditioned problems. We explore this question in the context of Bayesian linear regression and logistic regression, two fundamental real-world inference tasks. We conduct a careful empirical inspection of the sample quality of particles produced by Par VIs under various metrics, including mean squared error for estimating posterior mean and covariance, maximum mean discrepancy (Gretton et al., 2012), and kernel Stein discrepancy (Chwialkowski et al., 2016; Liu et al., 2016).
Researcher Affiliation Collaboration 1Department of Computer Science, Stanford University, Stanford, CA, USA 2Microsoft Research Asia, Beijing, 100080, China 3Dept. of Comp. Sci. & Tech., Institute for AI, BNRist Center, Tsinghua-Bosch ML Center, Tsinghua University, Beijing, 100084, China. Correspondence to: J. Zhu <dcszj@tsinghua.edu.cn>, Chang Liu <changliu@microsoft.com>, Michael H. Zhu <mhzhu@cs.stanford.edu>.
Pseudocode Yes Algorithm 1 Stochastic Variance Reduced Gradient (SVRG) for Par VIs; Algorithm 2 Stochastic Quasi-Newton with Variance Reduction (SQN-VR) for Par VIs (simplified under pairwise-close approximation)
Open Source Code No The paper does not provide a statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes We run experiments on 8 UCI regression datasets (Dua & Graff, 2019). Our MNIST (Le Cun et al., 1998) binary classification problem is classifying digits 7 vs. 9 after applying PCA to reduce the dimension of the image to 50, similar to Korattikara et al. (2014).
Dataset Splits No The paper only explicitly mentions an "80% training split" for the covtype dataset but does not specify the remaining split for testing or validation. It also doesn't provide general train/val/test splits for other datasets used.
Hardware Specification Yes For example, on an Intel Xeon E5-2640v3, 100 passes over the covtype dataset took 20 minutes for SVRG and SPIDER, 22 for SQN-VR, 24 for SGD, and 28 for Ada Grad.
Software Dependencies No The paper mentions "Py Stan" as an implementation used for obtaining ground truth MCMC samples but does not provide a version number for it or any other software dependencies crucial for reproducing the experiments.
Experiment Setup Yes We use 100 particles and a batch size of 10 in all of our experiments. We initialize the particles from a standard Gaussian, corresponding to the prior. For every optimizer, we tune the learning rate by running a grid search over S2 k= 1{ 10k N } where N is the number of data points. For Ada Grad, we additionally tune the learning rate in S5 k=3{ 10k N }, α {0.9, 0.95, 0.99, 0.999} and the fudge factor ϵ S8 k=4{10 k}. For SGD, we decay the learning rate after each epoch according to the formula ϵt = a/(t + b)β where the power β {0.55, 0.75, 0.95} and the constants a and b are chosen so that the total learning rate decay over the total number of epochs is in {1, 3, 10, 30, 100, 300, 1000}. For SVRG and SPIDER, we use a constant learning rate for the first half of the run and decay the learning rate in the second half by a factor in {1, 3, 10, 30, 100, 300, 1000}. For SQN-VR, we use a constant learning rate in S0 k= 5{10k, 3 10k} for the quasi-Newton updates and a memory size of 10. For all of the variance reduction methods, we update the full gradient over the entire dataset after each epoch, and we first run 10 epochs of SGD.