reproducibilityindex.ai

Bayesian Distributed Stochastic Gradient Descent

Authors: Michael Teng, Frank Wood

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments we show that eagerly discarding the mini-batch gradient computations of stragglers not only increases throughput but sometimes also increases the overall rate of convergence as a function of wall-clock time by virtue of eliminating idleness.
Researcher Affiliation	Academia	Michael Teng Department of Engineering Sciences University of Oxford mteng@robots.ox.ac.uk Frank Wood Department of Computer Science University of British Columbia fwood@cs.ubc.ca
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statements about releasing source code or links to a code repository for the described methodology.
Open Datasets	Yes	To test our model s ability to accurately predict joint worker runtimes, we perform experiments by training 4 different neural network architectures on one of two clusters of different architectures and sizes. [...] train a 2-layer CNN on MNIST classiﬁcation. [...] training three neural network architectures for CIFAR10 classiﬁcation: a Wide Res Net model (Zagoruyko and Komodakis, 2016) and 16 and 64 layer Res Nets.
Dataset Splits	Yes	Figure 3b shows MNIST validation loss for model-based methods, Elfving and BDSGD, compared to naive synchronous (waiting for all workers) and asynchronous (Hogwild) approaches, where dashed vertical lines indicate the time at which the ﬁnal iteration completed (all training methods perform the same number of mini-batch gradient updates). ...and validation loss at a ﬁxed wall-clock time (set to the wall-clock time at 50% of the total training time taken by Chen et al. s method).
Hardware Specification	Yes	On one cluster comprised of four nodes of 40 logical Intel Xeon processors, we benchmark Elfving and BDSGD cutoff predictors against the fully synchronous and fully asynchronous SGD with a 158-worker model... On a large compute cluster, we use 32 68-core CPU nodes of a Cray XC40 supercomputer to compare 2175-worker BDSGD runs against the Chen et al. cutoff and naive methods on training three neural network architectures for CIFAR10 classiﬁcation
Software Dependencies	No	The paper mentions software components and concepts like 'deep neural net auto-regressors', 'variational autoencoder loss', and 'Adam optimizer', but it does not specify any software names with version numbers (e.g., Python 3.x, PyTorch 1.x, TensorFlow 2.x).
Experiment Setup	No	The paper mentions general training parameters like 'learning rate' and 'batchsize' but does not provide specific hyperparameter values or detailed system-level training settings used in their experiments.