Toward Understanding the Impact of Staleness in Distributed Machine Learning

Authors: Wei Dai, Yi Zhou, Nanqing Dong, Hao Zhang, Eric Xing

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments reveal the rich diversity of the effects of staleness on the convergence of ML algorithms and offer insights into seemingly contradictory reports in the literature. The empirical findings also inspire a new convergence analysis of SGD in non-convex optimization under staleness, matching the best-known convergence rate of O(1/T).
Researcher Affiliation Collaboration Apple Inc, Duke University, and Petuum Inc
Pseudocode No The paper describes the update rule of Async-SGD as 'xk+1 = xk ηk |ξ(τk)| fξ(τk)(xτk),' which is a mathematical expression for an algorithm step, but it is not presented in a formal 'pseudocode' or 'algorithm' block.
Open Source Code No The paper does not provide any statement about releasing the source code for the methodology or a link to a code repository.
Open Datasets Yes Table 1: Overview of the models, algorithms... and dataset (Krizhevsky & Hinton, 2009; Marcus et al., 1993; Le Cun, 1998; Harper & Konstan, 2016; Rennie) in our study. Datasets mentioned include CIFAR10, Penn Treebank, MNIST, 20 News Group, Movie Lens1M.
Dataset Splits No The paper mentions using 'test accuracy' and 'test loss' for evaluation but does not specify the proportions or methodology for train/validation/test splits (e.g., '80/10/10 split', or if a separate validation set was used for hyperparameter tuning).
Hardware Specification No The paper discusses simulation on a 'single machine' and 'distributed machine learning systems' but does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for experiments.
Software Dependencies No The paper does not specify the versions of any programming languages, libraries, or frameworks used for implementation (e.g., Python 3.x, PyTorch 1.x, TensorFlow 2.x).
Experiment Setup Yes Table 1: Overview of the models, algorithms... Key Parameters... η denotes learning rate, which, if not specified, are tuned empirically for each algorithm and staleness level, β1, β2 are optimization hyperparameters... α, β in LDA are Dirichlet priors... We use batch size 32 for CNNs, DNNs, MLR, and VAEs... For MF, we use batch size of 25000 samples... For LDA we use D 10P as the batch size...