Two Losses Are Better Than One: Faster Optimization Using a Cheaper Proxy

Authors: Blake Woodworth, Konstantin Mishchenko, Francis Bach

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In Section 4, we conduct experiments to show the efficacy of our algorithm in realistic problems, including one with a non-convex objective. To exhibit the effectiveness of our method in practice, we conducted several experiments on realistic problems. Logistic regression: First, we consider a simple binary logistic regression problem with the mushrooms dataset, which consists of 8124 samples of dimension 112.
Researcher Affiliation Collaboration 1Inria, Ecole Normale Sup erieure, PSL Research University, Paris France 2Samsung AI Center, Cambridge, UK. Work done while at CNRS, Ecole Normale Sup erieure, Inria.
Pseudocode Yes Algorithm 1 PROXYPROX 1: Input: initialization w0 Rd, stepsize η > 0 2: for k = 0, 1, . . . , K 1 do 3: Sample gk such that E[gk | wk] = L(wk) 4: Set wk+1 arg minw φk(w) where φk(w) := gk, w +DˆL(w; wk)+ 1
Open Source Code No The paper does not contain any explicit statement about releasing source code for the described methodology or a link to a code repository.
Open Datasets Yes Logistic regression: First, we consider a simple binary logistic regression problem with the mushrooms dataset, which consists of 8124 samples of dimension 112. Res Net-18: In this experiment, summarized in Figure 2, we train a Res Net-18 network on CIFAR-10, defining ˆL using a 2560 subset of the images, and using the full train dataset of 50000 images for L.
Dataset Splits No The paper mentions training on the "full train dataset of 50000 images for L" and evaluating on "test loss" and "test accuracy" for CIFAR-10, but it does not provide explicit details about the split percentages or sample counts for training, validation, and test sets. It doesn't state if a validation set was used or how it was split.
Hardware Specification No The paper does not provide specific hardware details (such as exact GPU/CPU models or processor types) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details, such as library names with version numbers, needed to replicate the experiment.
Experiment Setup Yes In this experiment, the objective L is the logistic regression loss evaluated on the training data plus an ℓ2 regularizer with weight µ = 10 6H, where H is the smoothness parameter of the objective. Stochastic gradients for L are calculated by evaluating the gradient on a minibatches of size 256 or 1024 drawn uniformly with replacement. We compare running our method with either 20 and 40 iterations of SGD... with the stepsize tuned by grid search and ultimately set to 0.01. ...We use minibatch size of 128 for both gk and the SGD updates to minimize φk. Our method can be better than ADAMW for the first 25 epochs... We use standard stepsizes: 0.1 for SGD, 0.001 for ADAMW, and weight decay of 0.1 for ADAMW, which gave the best test accuracy in a grid search, and all methods also used cosine annealing.