A Generic Acceleration Framework for Stochastic Composite Optimization

Authors: Andrei Kulunchakov, Julien Mairal

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we perform numerical evaluations by following [28], which was notably able to make SVRG and SAGA robust to stochastic noise, and accelerate SVRG. Code to reproduce the experiments is provided with the submission and more details and experiments are given in Appendix E. [...] Experiments and conclusions. We run each experiment five time with a different random seed and average the results. All curves also display one standard deviation. Appendix E contains numerous experiments, where we vary the amount of noise, the type of approach (SVRG vs. SAGA), the amount of regularization µ, and choice of loss function. In Figure 1, we show a subset of these curves.
Researcher Affiliation Academia Andrei Kulunchakov and Julien Mairal Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France
Pseudocode Yes Algorithm 1 Generic Acceleration Framework with Exact Minimization of hk [...] Algorithm 2 Generic Acceleration Framework with Inexact Minimization of hk
Open Source Code Yes The code used for our experiments is available here: tt t r P .
Open Datasets Yes alpha is from the Pascal Large Scale Learning Challenge website2 and contains n = 250 000 points in dimension p = 500. gene consists of gene expression data and the binary labels bi characterize two different types of breast cancer. This is a small dataset with n = 295 and p = 8 141. ckn-cifar is an image classification task where each image from the CIFAR-10 dataset3 is represented by using a two-layer unsupervised convolutional neural network [36]. We consider here the binary classification task consisting of predicting the class 1 vs. other classes, and use our algorithms for the classification layer of the network, which is convex. The dataset contains n = 50 000 images and the dimension of the representation is p = 9 216.
Dataset Splits No The paper describes the datasets used and mentions total counts, but it does not specify explicit train/validation/test splits, percentages, or how cross-validation was performed for reproducibility.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory, or cloud instance types) used for running the experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., programming languages, libraries, or frameworks with their exact versions) used for the experiments.
Experiment Setup Yes Formulations. Given training data (ai, bi)i=1,...,n, with ai in Rp and bi in { 1, +1}, we consider the optimization problem min x Rp 1 n Pn i=1 φ(bia i x) + µ 2 x 22 where φ is either the logistic loss φ(u) = log(1+e u), or the squared hinge loss φ(u) = 1 2 max(0, 1 u)2, which are both L-smooth, with L = 0.25 for logistic and L = 1 for the squared hinge loss. The regularization parameter µ acts as the strong convexity constant for the problem and is chosen among the smallest values one would try when performing parameter search, e.g., by cross validation. Specifically, we consider µ = 1/10n and µ = 1/100n, where n is the number of training points; we also try µ = 1/1000n to evaluate the numerical stability of methods in very ill-conditioned problems. Following [7, 28, 54], we consider Drop Out perturbations [51] that is, setting each component ( f(x))i to 0 with a probability δ and to ( f(x))i/(1 δ) otherwise. This procedure is motivated by the need of a simple optimization benchmark illustrating stochastic finite-sum problems, where the amount of perturbation is easy to control. The settings used in our experiments are δ = 0 (no noise) and δ {0.01, 0.1}. [...] In all setups, we choose the parameter κ according to theory, which are described in the previous section, following Catalyst [33]. For composite problems, Proposition 5 suggests to use xk 1 as a warm start for inner-loop problems. For smooth ones, [33] shows that in fact, other choices such as yk 1 are appropriate and lead to similar complexity results. In our experiments with smooth losses, we use yk 1, which has shown to perform consistently better. The strategy for ηk discussed in Proposition 6 suggests to use constant step-sizes for a while in the inner-loop, typically of order 1/(κ + L) for the methods we consider, before using an exponentially decreasing schedule. Unfortunately, even though theory suggests a rate of decay in (1 q/2)k, it does not provide useful insight on when decaying should start since the theoretical time requires knowing σ2. A similar issue arise in stochastic optimization techniques involving iterate averaging [9]. We adopt a similar heuristic as in this literature and start decaying after k0 epochs, with k0 = 30. Finally, we discuss the number of iterations of M to perform in the inner-loop. When ηk = 1, the theoretical value is of order O(1/τ) = O(n), and we choose exactly n iterations (one epoch), as in Catalyst [33]. After starting decaying the step-sizes (ηk < 1), we use n/ηk , according to theory.