Fast margin maximization via dual acceleration

Authors: Ziwei Ji, Nathan Srebro, Matus Telgarsky

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our main result is that these iterates, with a proper choice of θt and βt, can maximize the margin at a rate of e O(1/t2), whereas prior work had a rate of O(1/t) at best. The key idea is to reverse the primal-dual relationship mentioned above: those works focus on primal normalized gradient descent, and show that it is equivalent to dual mirror descent, but here we start from the dual, and apply Nesterov acceleration to make dual optimization faster, and then translate the dual iterates into the momentum form in eq. (1.1). Note that if our goal is just to accelerate dual optimization, then it is natural to apply Nesterov s method; however, here our goal is to accelerate (primal) margin maximization it was unclear whether the momentum method changes the implicit bias, and our margin analysis is very different from the standard analysis of Nesterov s method. The connection between momentum in the primal and acceleration in the dual also appears to be new, and we provide it as an auxiliary contribution. We state the method in full in Algorithm 1, and its analysis in Section 3. For sake of presentation, the preceding analyses and algorithm definitions use the exponential loss, however they can be extended to both binary and multiclass losses with exponential tails. The multiclass extension is in fact a straightforward reduction to the binary case, and is used in most figures throughout this work. We discuss these extensions in Section 5. As an illustrative application of these fast margin maximization methods, we use them to study the evolution of the kernel given by various stages of deep network training. The main point of interest is that while these kernels do seem to generally improve during training (in terms of both margins and test errors), we provide an example where simply changing the random seed switches between preferring the final kernel and the initial kernel. These empirical results appear in Section 6.
Researcher Affiliation Collaboration 1Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA 2Toyota Technical Institute of Chicago, Chicago, Illinois, USA. Correspondence to: Ziwei Ji <ziweiji2@illinois.edu>.
Pseudocode Yes Algorithm 1 Input: data matrix Z Rn d, step size (θt) t=0, momentum factor (βt) t=0. Initialize: w0 = g 1 = (0, . . . , 0) Rd, q0 = ( 1 n, . . . , 1 n) n. for t = 0, 1, 2, . . . do gt βt(gt 1 + Z qt). wt+1 wt θt gt + Z qt . qt+1 exp(Zwt+1), and qt+1 n. end for
Open Source Code No The paper does not provide an explicit statement of code release or a link to a repository for the described methodology.
Open Datasets Yes The data here is linearly separable, specifically mnist digits 0 and 1. (Figure 1 caption). Here the various margin-maximization methods from Figure 1 are run on non-separable data, specifically mnist digits 3 and 5; as such, test error and not margin are reported. (Figure 2 caption). Still with βt = t/(t + 1), Algorithm 2 can slightly beat the batch perceptron method, which is the fastest prior algorithm in the hard-margin kernel SVM setting. (Other classical methods, such as stochastic dual coordinate ascent (Shalev-Shwartz & Zhang, 2013), are focused on the nonseparable soft-margin SVM setting.) Although we do not have a convergence analysis for Algorithm 2 with a nonzero momentum, it works well in practice, as verified on the full mnist data, shown in Figure 3. (Section 4). As an application of these fast margin-maximization methods, we study the evolution of kernels encountered during deep network training. Specifically, consider the cifar10 dataset, which has 50,000 input images in 10 classes; a standard deep network architecture for this problem is the Alex Net (Krizhevsky et al., 2012), which has both convolutional, dense linear, and various nonlinear layers. (Section 6).
Dataset Splits No The paper does not explicitly provide specific training/test/validation dataset splits or cross-validation setup details.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers.
Experiment Setup No The paper does not provide concrete hyperparameter values, training configurations, or system-level settings in the main text.