Toward Understanding the Impact of Staleness in Distributed Machine Learning
Authors: Wei Dai, Yi Zhou, Nanqing Dong, Hao Zhang, Eric Xing
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments reveal the rich diversity of the effects of staleness on the convergence of ML algorithms and offer insights into seemingly contradictory reports in the literature. The empirical findings also inspire a new convergence analysis of SGD in non-convex optimization under staleness, matching the best-known convergence rate of O(1/T). |
| Researcher Affiliation | Collaboration | Apple Inc, Duke University, and Petuum Inc |
| Pseudocode | No | The paper describes the update rule of Async-SGD as 'xk+1 = xk ηk |ξ(τk)| fξ(τk)(xτk),' which is a mathematical expression for an algorithm step, but it is not presented in a formal 'pseudocode' or 'algorithm' block. |
| Open Source Code | No | The paper does not provide any statement about releasing the source code for the methodology or a link to a code repository. |
| Open Datasets | Yes | Table 1: Overview of the models, algorithms... and dataset (Krizhevsky & Hinton, 2009; Marcus et al., 1993; Le Cun, 1998; Harper & Konstan, 2016; Rennie) in our study. Datasets mentioned include CIFAR10, Penn Treebank, MNIST, 20 News Group, Movie Lens1M. |
| Dataset Splits | No | The paper mentions using 'test accuracy' and 'test loss' for evaluation but does not specify the proportions or methodology for train/validation/test splits (e.g., '80/10/10 split', or if a separate validation set was used for hyperparameter tuning). |
| Hardware Specification | No | The paper discusses simulation on a 'single machine' and 'distributed machine learning systems' but does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for experiments. |
| Software Dependencies | No | The paper does not specify the versions of any programming languages, libraries, or frameworks used for implementation (e.g., Python 3.x, PyTorch 1.x, TensorFlow 2.x). |
| Experiment Setup | Yes | Table 1: Overview of the models, algorithms... Key Parameters... η denotes learning rate, which, if not specified, are tuned empirically for each algorithm and staleness level, β1, β2 are optimization hyperparameters... α, β in LDA are Dirichlet priors... We use batch size 32 for CNNs, DNNs, MLR, and VAEs... For MF, we use batch size of 25000 samples... For LDA we use D 10P as the batch size... |