Weight for Robustness: A Comprehensive Approach towards Optimal Fault-Tolerant Asynchronous ML

Authors: Tehila Dahan, Kfir Y. Levy

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our methodology is rigorously validated through empirical and theoretical analysis, demonstrating its effectiveness in enhancing fault tolerance and optimizing performance in asynchronous ML systems. To evaluate the effectiveness of our proposed approach, we conducted experiments on MNIST [Le Cun et al., 2010] and CIFAR-10 [Krizhevsky et al., 2014] datasets two recognized benchmarks in image classification tasks.
Researcher Affiliation Academia Tehila Dahan Department of Electrical Engineering Technion Haifa, Israel t.dahan@campus.technion.ac.il Kfir Y. Levy Department of Electrical Engineering Technion Haifa, Israel kfirylevy@technion.ac.il
Pseudocode Yes Algorithm 1 Weighted Centered Trimmed Meta Aggregator (ω-CTMA); Algorithm 2 Asynchronous Robust µ2-SGD
Open Source Code Yes For more details, please visit our Git Hub repository.1 1https://github.com/dahan198/asynchronous-fault-tolerant-ml
Open Datasets Yes We simulated over the MNIST [Le Cun et al., 2010] and CIFAR-10 [Krizhevsky et al., 2014] datasets. The datasets were accessed through torchvision (version 0.16.2). MNIST Dataset. MNIST is a widely used benchmark dataset in the machine learning community, consisting of 70,000 grayscale images of handwritten digits (0-9) with a resolution of 28x28 pixels. The dataset is split into 60,000 training images and 10,000 test images. CIFAR-10 Dataset. CIFAR-10 is a widely recognized benchmark dataset in the machine learning community, containing 60,000 color images categorized into 10 different classes. Each image has a resolution of 32x32 pixels and represents objects such as airplanes, automobiles, birds, cats, and more. The dataset is evenly split into 50,000 training images and 10,000 test images.
Dataset Splits No The paper mentions training and testing splits, but does not explicitly detail a validation split or its size/percentage, only that datasets are split into training and testing images.
Hardware Specification Yes all computations were executed on an NVIDIA L40S GPU.
Software Dependencies Yes We employed a two-layer convolutional neural network architecture for both datasets, implemented using the Py Torch framework. The datasets were accessed through torchvision (version 0.16.2).
Experiment Setup Yes Optimization Setup. We optimized the cross-entropy loss across all experiments. For comparisons, we configured µ2-SGD with fixed parameters γ = 0.1 and β = 0.25. This was tested against Standard SGD, and Momentum-based SGD, where the momentum parameter was set to β = 0.9 as recommended by Karimireddy et al. [2021]. Parameter MNIST CIFAR-10 Model Architecture Conv(1,20,5), Re LU, Max Pool(2x2), Conv(20,50,5), Re LU, Max Pool(2x2), FC(800 50), Batch Norm, Re LU, FC(50 10) Conv(3,20,5), Re LU, Max Pool(2x2), Conv(20,50,5), Re LU, Max Pool(2x2), FC(1250 50), Batch Norm, Re LU, FC(50 10) Learning Rate 0.01 0.01 Batch Size 16 8 Data Processing & Augmentation Normalize(mean=(0.1307), std=(0.3081)) Random Crop(size=32, padding=2), Random Horizontal Flip(p=0.5), Normalize(mean=(0.4914, 0.4822, 0.4465), std=(0.2023, 0.1994, 0.2010))