Delay-Adaptive Distributed Stochastic Optimization

Authors: Zhaolin Ren, Zhengyuan Zhou, Linhai Qiu, Ajay Deshpande, Jayant Kalagnanam5503-5510

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 6 Numerical Results We first verify convergence of delayadaptive DASGD on a standard Rosenbrock test function with d = 101 degrees of freedom, ... We also compared the accuracies of a logistic regression model learned using the delay-adaptive vs nondelay-adaptive DASGD algorithms on the MNIST dataset (Le Cun 1998), a standard benchmark for machine learning tasks.
Researcher Affiliation Collaboration Zhaolin Ren,1 Zhengyuan Zhou, 2,4 Linhai Qiu,3 Ajay Deshpande,4 Jayant Kalagnanam4 1Harvard University, 2New York University, 3Google Inc., 4IBM Research
Pseudocode Yes Algorithm 1: Distributed Asynchronous Stochastic Gradient Descent Require: Y0 Rd ... Algorithm 2: Delay-Adaptive DASGD Require: Y0 Rd ... Algorithm 3: DASGD-T Require: Initial state X0 Rd, step-size sequence αn, initial truncation parameter τ0
Open Source Code No The paper does not include any statement about making its source code publicly available or provide a link to a code repository.
Open Datasets Yes MNIST test We also compared the accuracies of a logistic regression model learned using the delay-adaptive vs nondelay-adaptive DASGD algorithms on the MNIST dataset (Le Cun 1998), a standard benchmark for machine learning tasks.
Dataset Splits No The paper mentions training models and evaluating on 'test accuracy' but does not provide specific train/validation/test dataset splits (e.g., percentages or sample counts).
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for running experiments. It mentions distributed architectures and number of workers, but not specific hardware components.
Software Dependencies No The paper does not list specific software dependencies with version numbers, such as programming languages, libraries, or frameworks used for implementation or experiments.
Experiment Setup Yes In all the cases with different delay functions, we use the same initial condition that was generated randomly but fixed throughout all the runs. For all the cases, we drew noise from a standard multivariate Gaussian distribution when computing the gradient of each time step. We ran 10 trials and averaged the results for each case. ... In the non-delay-adaptive cases, we used the fixed step size of 1e 4 as the baseline. In the delay-adaptive cases, we use the step size 1e 4/(n log n log log n + nc(log(n)/ log s(n))), where c = 1/2, but when the delay is less than 10, we use the nonadaptive step size of 1e 4 in order to allow faster convergence when the delay is small.