Bayesian Distributed Stochastic Gradient Descent

Authors: Michael Teng, Frank Wood

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments we show that eagerly discarding the mini-batch gradient computations of stragglers not only increases throughput but sometimes also increases the overall rate of convergence as a function of wall-clock time by virtue of eliminating idleness.
Researcher Affiliation Academia Michael Teng Department of Engineering Sciences University of Oxford mteng@robots.ox.ac.uk Frank Wood Department of Computer Science University of British Columbia fwood@cs.ubc.ca
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about releasing source code or links to a code repository for the described methodology.
Open Datasets Yes To test our model s ability to accurately predict joint worker runtimes, we perform experiments by training 4 different neural network architectures on one of two clusters of different architectures and sizes. [...] train a 2-layer CNN on MNIST classification. [...] training three neural network architectures for CIFAR10 classification: a Wide Res Net model (Zagoruyko and Komodakis, 2016) and 16 and 64 layer Res Nets.
Dataset Splits Yes Figure 3b shows MNIST validation loss for model-based methods, Elfving and BDSGD, compared to naive synchronous (waiting for all workers) and asynchronous (Hogwild) approaches, where dashed vertical lines indicate the time at which the final iteration completed (all training methods perform the same number of mini-batch gradient updates). ...and validation loss at a fixed wall-clock time (set to the wall-clock time at 50% of the total training time taken by Chen et al. s method).
Hardware Specification Yes On one cluster comprised of four nodes of 40 logical Intel Xeon processors, we benchmark Elfving and BDSGD cutoff predictors against the fully synchronous and fully asynchronous SGD with a 158-worker model... On a large compute cluster, we use 32 68-core CPU nodes of a Cray XC40 supercomputer to compare 2175-worker BDSGD runs against the Chen et al. cutoff and naive methods on training three neural network architectures for CIFAR10 classification
Software Dependencies No The paper mentions software components and concepts like 'deep neural net auto-regressors', 'variational autoencoder loss', and 'Adam optimizer', but it does not specify any software names with version numbers (e.g., Python 3.x, PyTorch 1.x, TensorFlow 2.x).
Experiment Setup No The paper mentions general training parameters like 'learning rate' and 'batchsize' but does not provide specific hyperparameter values or detailed system-level training settings used in their experiments.