Asynchronous Stochastic Optimization Robust to Arbitrary Delays

Authors: Alon Cohen, Amit Daniely, Yoel Drori, Tomer Koren, Mariano Schain

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To illustrate the robustness and efficacy of Picky SGD, we present a comparison between the performance of SGD versus Picky SGD under various delay distributions. All training is performed on the standard CIFAR-10 dataset [15] using a Res Net56 with 9 blocks model [13] and implemented in Tensor Flow [1]. We compare Picky SGD (Algorithm 1) to the SGD algorithm which unconditionally updates the state x_t given the stochastic delayed gradient g_t (recall that g_t is the stochastic gradient at state x_t - d_t).
Researcher Affiliation Collaboration Alon Cohen Tel Aviv University and Google Research Israel alonco@tauex.tau.ac.il Amit Daniely Hebrew University of Jerusalem and Google Research Israel amit.daniely@mail.huji.ac.il Yoel Drori Google Research Israel dyoel@google.com Tomer Koren Tel Aviv University and Google Research Israel tkoren@tauex.tau.ac.il Mariano Schain Google Research Israel marianos@google.com
Pseudocode Yes Algorithm 1: Picky SGD 1: input: learning rate , target accuracy . 2: for t= 1, . . . ,Tdo 3: receive delayed stochastic gradient g_t and point x_t-d_t such that E_t[g_t] = f(x_t-d_t). 4: if x_t - x_t-d_t /(2 ) then 5: update: x_t+1 = x_t - g_t. 6: else 7: pass: x_t+1 = x_t. 8: end if 9: end for
Open Source Code No The paper does not provide an explicit statement about making the source code available or a link to a code repository for the methodology described.
Open Datasets Yes All training is performed on the standard CIFAR-10 dataset [15] using a Res Net56 with 9 blocks model [13] and implemented in Tensor Flow [1].
Dataset Splits No The paper mentions optimizing meta-parameters and using a baseline learning rate schedule, implying some form of hyperparameter tuning, but it does not provide specific details on training, validation, or test dataset splits (e.g., percentages or counts).
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions "Tensor Flow [1]" but does not provide specific version numbers for TensorFlow or any other ancillary software dependencies.
Experiment Setup Yes Batch size 64 was used throughout the experiments. Note that although use chose the threshold value /2 by an exhaustive search, in practice, a good choice can be found by logging the distance values during a typical execution and choosing a high percentile value. For both algorithms, instead of a constant learning rate we use a piecewise-linear learning rate schedule as follows: we consider a baseline 0 piecewise-linear learning rate schedule5 that achieves optimal performance in a synchronous distributed optimization setting (that is, for d_t 0)6 and search (for each of the four delay schedules and each algorithm to compensate for the effects of delays) for the best multiple of the baseline rate and the best first rate-change point.