Fault Tolerance in Iterative-Convergent Machine Learning

Authors: Aurick Qiao, Bryon Aragam, Bingjing Zhang, Eric Xing

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental With our evaluation, we wish to (1) illustrate our rework cost bounds for different types of perturbations using practical ML models, (2) empirically measure the rework costs of a variety of models under the partial recovery and prioritized checkpoint strategies in SCAR, and (3) show that SCAR incurs near-optimal rework cost in a set of large-scale experiments.
Researcher Affiliation Collaboration 1Petuum, Inc., Pittsburgh, Pennsylvania, USA 2Computer Science Department, Carnegie Mellon Univeristy, Pittsburgh, Pennsylvania, USA 3Machine Learning Department, Carnegie Mellon Univeristy, Pittsburgh, Pennsylvania, USA.
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statement about open-sourcing the code for SCAR or the described methodology.
Open Datasets Yes Multinomial Logistic Regression (MLR) trained with stochastic (minibatch) gradient descent. We train MLR on the MNIST (Lecun et al., 1998) and Cover Type (Dheeru & Karra Taniskidou, 2017) datasets. Matrix Factorization (MF) trained with alternating least squares (ALS). We train MF on the Movie Lens (Harper & Konstan, 2015) and Jester (Goldberg et al., 2001) datasets. Latent Dirichlet Allocation (LDA) trained with collapsed Gibbs sampling (Liu, 1994). We train LDA on the 20 Newsgroups (Lang, 1995) and Reuters (Lewis et al., 2004) datasets.
Dataset Splits No The paper does not specify explicit training/validation/test splits (e.g., percentages, sample counts, or references to predefined splits) for the datasets used.
Hardware Specification Yes We use four AWS i3.2xlarge instances to train MLR on the full 26GB Criteo (Juan et al., 2016) dataset, and LDA on a 12GB subset of the Clue Web12 dataset (Gabrilovich et al., 2013).
Software Dependencies No The paper mentions software like TensorFlow and PyTorch but does not provide specific version numbers for these or other ancillary software components, which is required for reproducibility.
Experiment Setup Yes For both MLR and LDA, we trigger a failure of 25% of parameters (corresponding to a single failed node in our 4-node cluster) after 7 epochs. We compare SCAR, which saves 1/8 of the highest-priority parameters every epoch, with traditional checkpointing, which saves all parameters every 8 epochs.