Debiasing Distributed Second Order Optimization with Surrogate Sketching and Scaled Regularization

Authors: Michal Derezinski, Burak Bartan, Mert Pilanci, Michael W. Mahoney

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we propose a new class of sketching methods, called surrogate sketches, which allow us to debias local estimates of the Newton step, thereby making distributed second order optimization more scalable. In our analysis of the surrogate sketches, we exploit recent developments in determinantal point processes (DPPs, [16]) to give exact formulas for the bias of the estimates produced with those sketches, enabling us to correct that bias. Due to algorithmic advances in DPP sampling, surrogate sketches can be implemented in time nearly linear in the size of the data, when the number of data points is much larger than their dimensionality, so our results lead to direct improvements in the time complexity of distributed second order optimization. Remarkably, our analysis of the bias of surrogate sketches leads to a simple technique for debiasing the local Newton estimates for l2-regularized problems, which we call scaled regularization. We show that the regularizer used on the sketched data should be scaled down compared to the global regularizer, and we give an explicit formula for that scaling. Our empirical results demonstrate that scaled regularization significantly reduces the bias of local Newton estimates not only for surrogate sketches, but also for a range of other sketching techniques.
Researcher Affiliation Academia Michał Derezi nski Department of Statistics University of California, Berkeley; Burak Bartan Department of Electrical Engineering Stanford University; Mert Pilanci Department of Electrical Engineering Stanford University; Michael W. Mahoney ICSI and Department of Statistics University of California, Berkeley
Pseudocode No The paper does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about releasing source code or links to code repositories for the described methodology.
Open Datasets Yes Figures 2 and 4 show the estimation error as a function of the number of averaged outputs for the regularized least squares problem discussed in Section 1.1, on Cifar-10 and Boston housing prices datasets, respectively. Figure 3 shows the estimation error against iterations for the distributed Newton sketch algorithm running on a logistic regression problem with 2 regularization on three different binary classification UCI datasets.
Dataset Splits No The paper does not provide specific training, validation, and test dataset splits (e.g., percentages or exact counts) or refer to standard splits with citations. While it mentions parameters for a backtracking line search in Figure 3, it doesn't define dataset splits.
Hardware Specification No The paper does not specify any particular hardware used for running the experiments (e.g., GPU/CPU models, memory specifications).
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes The step size for distributed Newton sketch updates has been determined via backtracking line search with parameters = 2, c = 0.1, a0 = 1. In all the experiments, we have q = 100 workers and λ = 10 4. The dimensions for each dataset are (690 14), (699 9), (351 33), and the sketch sizes are m = 50, 50, 100 for plots a,b,c, respectively. We used: λ = 10, λ0 = 4.06, and sketch size m = 50.